Introduction
Pandas is a strong Python library for working with and analyzing knowledge. One operation that you simply may have to carry out when working with knowledge in Pandas is choosing columns primarily based on their string prefix. This may be helpful when you’ve a big DataFrame and also you need to give attention to particular columns that share a typical prefix.
On this Byte, we’ll discover a couple of strategies to realize this, together with making a sequence to pick columns and utilizing DataFrame.loc
.
Choose All Columns Beginning with a Given String
Let’s begin with a easy DataFrame:
import pandas as pd
knowledge = {
'item1': [1, 2, 3],
'item2': [4, 5, 6],
'stuff1': [7, 8, 9],
'stuff2': [10, 11, 12]
}
df = pd.DataFrame(knowledge)
print(df)
Output:
item1 item2 stuff1 stuff2
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
To pick out columns that begin with ‘merchandise’, you should use checklist comprehension:
selected_columns = [column for column in df.columns if column.startswith('item')]
print(df[selected_columns])
Output:
item1 item2
0 1 4
1 2 5
2 3 6
Making a Sequence to Choose Columns
One other strategy to pick columns primarily based on their string prefix is to create a Sequence object from the DataFrame columns, after which use the str.startswith()
technique. This technique returns a boolean Sequence the place a True
worth signifies that the column identify begins with the required string.
selected_columns = pd.Sequence(df.columns).str.startswith('merchandise')
print(df.loc[:, selected_columns])
Output:
item1 item2
0 1 4
1 2 5
2 3 6
Utilizing DataFrame.loc to Choose Columns
The DataFrame.loc
technique is primarily label-based, however may additionally be used with a boolean array. The ix
indexer for DataFrame is deprecated now, because it has numerous issues. .loc
will increase a KeyError
when the objects will not be discovered.
Think about the next instance:
selected_columns = df.columns[df.columns.str.startswith('item')]
print(df.loc[:, selected_columns])
Output:
item1 item2
0 1 4
1 2 5
2 3 6
Right here, we first create a boolean array that’s True
for columns beginning with ‘merchandise’. Then, we use this array to pick the corresponding columns from the DataFrame utilizing the .loc
indexer. This technique is extra environment friendly than the earlier ones, particularly for big DataFrames, because it avoids creating an intermediate checklist or Sequence.
Making use of DataFrame.filter() for Column Choice
The filter()
operate in pandas DataFrame gives a versatile and environment friendly option to choose columns primarily based on their names. It’s particularly helpful when coping with giant datasets with many columns.
The filter()
operate permits us to pick columns primarily based on their labels. We will use the like
parameter to specify a string sample that matches the column names. Nevertheless, if we need to choose columns primarily based on a string prefix, we are able to use the regex
parameter.
Here is an instance:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'product_id': [101, 102, 103, 104],
'product_name': ['apple', 'banana', 'cherry', 'date'],
'product_price': [1.2, 0.5, 0.75, 1.3],
'product_weight': [150, 120, 50, 60]
})
# Choose columns that begin with 'product'
df_filtered = df.filter(regex='^product')
print(df_filtered)
This can output:
product_id product_name product_price product_weight
0 101 apple 1.20 150
1 102 banana 0.50 120
2 103 cherry 0.75 50
3 104 date 1.30 60
Within the above code, the ^
image is a daily expression that matches the beginning of a string. Subsequently, '^product'
will match all column names that begin with ‘product’.
Subsequent: The filter()
operate returns a brand new DataFrame that shares the information with the unique DataFrame. So, any modifications to the brand new DataFrame will not have an effect on the unique DataFrame.
Conclusion
On this Byte, we explored alternative ways to pick columns in a pandas DataFrame primarily based on a string prefix. We discovered methods to create a Sequence and use it to pick columns, methods to use the DataFrame.loc
operate, and methods to apply the DataFrame.filter()
operate. In fact, every of those strategies has its personal benefits and use instances. The selection of technique will depend on the particular necessities of your knowledge evaluation activity.