[python] pandas: filter rows of DataFrame with operator chaining

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I've found to filter rows is via normal bracket indexing

df_filtered = df[df['column'] == value]

This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?

df_filtered = df.mask(lambda x: x['column'] == value)

This question is related to python pandas dataframe

The answer is


My answer is similar to the others. If you do not want to create a new function you can use what pandas has defined for you already. Use the pipe method.

df.pipe(lambda d: d[d['column'] == value])

If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(3, size=(10, 5)),
    columns=list('ABCDE')
)

df
# Out[55]: 
#    A  B  C  D  E
# 0  0  2  2  2  2
# 1  1  1  2  0  2
# 2  0  2  0  0  2
# 3  0  2  2  0  1
# 4  0  1  1  2  0
# 5  0  0  0  1  2
# 6  1  0  1  1  1
# 7  0  0  2  0  2
# 8  2  2  2  2  2
# 9  1  2  0  2  1

df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]: 
#    A  D  B  C  E
# 0  0  2  2  2  2
# 1  0  2  1  1  0

I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:

df[((df.A==1) == True) | ((df.D==6) == True)]

I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/

I'll add other edits to make this post more useful.

pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 5)),
    columns=list('ABCDE')
)

df

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

Let's use query to filter all rows where D > B

df.query('D > B')

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

Which we chain

df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

The answer from @lodagro is great. I would extend it by generalizing the mask function as:

def mask(df, f):
  return df[f(df)]

Then you can do stuff like:

df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)

This is unappealing as it requires I assign df to a variable before being able to filter on its values.

df[df["column_name"] != 5].groupby("other_column_name")

seems to work: you can nest the [] operator as well. Maybe they added it since you asked the question.


This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

You don't need to download the entire repo: saving the file and doing

from where import where as W

should suffice. Then you use it like this:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

A slightly less stupid usage example:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

By the way: even in the case in which you are just using boolean cols,

df.loc[W['cond1']].loc[W['cond2']]

can be much more efficient than

df.loc[W['cond1'] & W['cond2']]

because it evaluates cond2 only where cond1 is True.

DISCLAIMER: I first gave this answer elsewhere because I hadn't seen this.


Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.

The code below can filter the rows by value.

df_filtered = df.loc[df['column'] == value]

By modifying it a bit you can filter the columns as well.

df_filtered = df.loc[df['column'] == value, ['year', 'column']]

So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,

res =  df\
    .loc[df['station']=='USA', ['TEMP', 'RF']]\
    .groupby('year')\
    .agg(np.nanmean)

Filters can be chained using a Pandas query:

df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')

Filters can also be combined in a single query:

df_filtered = df.query('a > 0 and 0 < b < 2')

You can also leverage the numpy library for logical operations. Its pretty fast.

df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]

Since version 0.18.1 the .loc method accepts a callable for selection. Together with lambda functions you can create very flexible chainable filters:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80]  # equivalent to df[df.A == 80] but chainable

df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]

If all you're doing is filtering, you can also omit the .loc.


pandas provides two alternatives to Wouter Overmeire's answer which do not require any overriding. One is .loc[.] with a callable, as in

df_filtered = df.loc[lambda x: x['column'] == value]

the other is .pipe(), as in

df_filtered = df.pipe(lambda x: x['column'] == value)

If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows:

pd.DataFrame = apply_masks()

Usage:

A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary

It's a little bit hacky but it can make things a little bit cleaner if you're continuously chopping and changing datasets according to filters. There's also a general purpose filter adapted from Daniel Velkov above in the gen_mask function which you can use with lambda functions or otherwise if desired.

File to be saved (I use masks.py):

import pandas as pd

def eq_mask(df, key, value):
    return df[df[key] == value]

def ge_mask(df, key, value):
    return df[df[key] >= value]

def gt_mask(df, key, value):
    return df[df[key] > value]

def le_mask(df, key, value):
    return df[df[key] <= value]

def lt_mask(df, key, value):
    return df[df[key] < value]

def ne_mask(df, key, value):
    return df[df[key] != value]

def gen_mask(df, f):
    return df[f(df)]

def apply_masks():

    pd.DataFrame.eq_mask = eq_mask
    pd.DataFrame.ge_mask = ge_mask
    pd.DataFrame.gt_mask = gt_mask
    pd.DataFrame.le_mask = le_mask
    pd.DataFrame.lt_mask = lt_mask
    pd.DataFrame.ne_mask = ne_mask
    pd.DataFrame.gen_mask = gen_mask

    return pd.DataFrame

if __name__ == '__main__':
    pass

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe