Friday, August 18, 2023
HomeProgrammingHow one can Choose Columns in Pandas Primarily based on a String...

How one can Choose Columns in Pandas Primarily based on a String Prefix


Introduction

Pandas is a strong Python library for working with and analyzing knowledge. One operation that you simply may have to carry out when working with knowledge in Pandas is choosing columns primarily based on their string prefix. This may be helpful when you’ve a big DataFrame and also you need to give attention to particular columns that share a typical prefix.

On this Byte, we’ll discover a couple of strategies to realize this, together with making a sequence to pick columns and utilizing DataFrame.loc.

Choose All Columns Beginning with a Given String

Let’s begin with a easy DataFrame:

import pandas as pd

knowledge = {
    'item1': [1, 2, 3],
    'item2': [4, 5, 6],
    'stuff1': [7, 8, 9],
    'stuff2': [10, 11, 12]
}
df = pd.DataFrame(knowledge)
print(df)

Output:

   item1  item2  stuff1  stuff2
0      1      4       7      10
1      2      5       8      11
2      3      6       9      12

To pick out columns that begin with ‘merchandise’, you should use checklist comprehension:

selected_columns = [column for column in df.columns if column.startswith('item')]
print(df[selected_columns])

Output:

   item1  item2
0      1      4
1      2      5
2      3      6

Making a Sequence to Choose Columns

One other strategy to pick columns primarily based on their string prefix is to create a Sequence object from the DataFrame columns, after which use the str.startswith() technique. This technique returns a boolean Sequence the place a True worth signifies that the column identify begins with the required string.

selected_columns = pd.Sequence(df.columns).str.startswith('merchandise')
print(df.loc[:, selected_columns])

Output:

   item1  item2
0      1      4
1      2      5
2      3      6

Utilizing DataFrame.loc to Choose Columns

The DataFrame.loc technique is primarily label-based, however may additionally be used with a boolean array. The ix indexer for DataFrame is deprecated now, because it has numerous issues. .loc will increase a KeyError when the objects will not be discovered.

Think about the next instance:

selected_columns = df.columns[df.columns.str.startswith('item')]
print(df.loc[:, selected_columns])

Output:

   item1  item2
0      1      4
1      2      5
2      3      6

Right here, we first create a boolean array that’s True for columns beginning with ‘merchandise’. Then, we use this array to pick the corresponding columns from the DataFrame utilizing the .loc indexer. This technique is extra environment friendly than the earlier ones, particularly for big DataFrames, because it avoids creating an intermediate checklist or Sequence.

Making use of DataFrame.filter() for Column Choice

The filter() operate in pandas DataFrame gives a versatile and environment friendly option to choose columns primarily based on their names. It’s particularly helpful when coping with giant datasets with many columns.

The filter() operate permits us to pick columns primarily based on their labels. We will use the like parameter to specify a string sample that matches the column names. Nevertheless, if we need to choose columns primarily based on a string prefix, we are able to use the regex parameter.

Here is an instance:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'product_id': [101, 102, 103, 104],
    'product_name': ['apple', 'banana', 'cherry', 'date'],
    'product_price': [1.2, 0.5, 0.75, 1.3],
    'product_weight': [150, 120, 50, 60]
})

# Choose columns that begin with 'product'
df_filtered = df.filter(regex='^product')

print(df_filtered)

This can output:

   product_id product_name  product_price  product_weight
0         101        apple           1.20             150
1         102       banana           0.50             120
2         103       cherry           0.75              50
3         104         date           1.30              60

Within the above code, the ^ image is a daily expression that matches the beginning of a string. Subsequently, '^product' will match all column names that begin with ‘product’.

Subsequent: The filter() operate returns a brand new DataFrame that shares the information with the unique DataFrame. So, any modifications to the brand new DataFrame will not have an effect on the unique DataFrame.

Conclusion

On this Byte, we explored alternative ways to pick columns in a pandas DataFrame primarily based on a string prefix. We discovered methods to create a Sequence and use it to pick columns, methods to use the DataFrame.loc operate, and methods to apply the DataFrame.filter() operate. In fact, every of those strategies has its personal benefits and use instances. The selection of technique will depend on the particular necessities of your knowledge evaluation activity.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments