[python] pandas: best way to select all columns whose names start with X

I have a DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

I want to select values of 1 in columns starting with foo.. Is there a better way to do it other than:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

Something similar to writing something like:

df2= df[df.STARTS_WITH_FOO == 1]

The answer should print out a DataFrame like this:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

This question is related to python pandas dataframe selection

The answer is


Another option for the selection of the desired entries is to use map:

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

which gives you all the columns for rows that contain a 1:

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

The row selection is done by

(df == 1).any(axis=1)

as in @ajcr's answer which gives you:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

meaning that row 3 and 4 do not contain a 1 and won't be selected.

The selection of the columns is done using Boolean indexing like this:

df.columns.map(lambda x: x.startswith('foo'))

In the example above this returns

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

So, if a column does not start with foo, False is returned and the column is therefore not selected.

If you just want to return all rows that contain a 1 - as your desired output suggests - you can simply do

df.loc[(df == 1).any(axis=1)]

which returns

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

The simplest way is to use str directly on column names, there is no need for pd.Series

df.loc[:,df.columns.str.startswith("foo")]



You can try the regex here to filter out the columns starting with "foo"

df.filter(regex='^foo*')

If you need to have the string foo in your column then

df.filter(regex='foo*')

would be appropriate.

For the next step, you can use

df[df.filter(regex='^foo*').values==1]

to filter out the rows where one of the values of 'foo*' column is 1.


Now that pandas' indexes support string operations, arguably the simplest and best way to select columns beginning with 'foo' is just:

df.loc[:, df.columns.str.startswith('foo')]

Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

My solution. It may be slower on performance:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Based on @EdChum's answer, you can try the following solution:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column's name.

In essence, I replaced .startswith() with .contains().


In my case I needed a list of prefixes

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to selection

Selection with .loc in python pandas: best way to select all columns whose names start with X UITableViewCell Selected Background Color on Multiple Selection Android RecyclerView addition & removal of items How can I get a list of all values in select box? Multiple select in Visual Studio? VBA: Selecting range by variables jQuery - select all text from a textarea How to select an item in a ListView programmatically? XPath: select text node