[python] Efficient way to apply multiple filters to pandas DataFrame or Series

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.

The filters should be additive (aka each one applied should narrow results).

I'm currently using reindex() but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). So, this could be really inefficient when filtering a big Series or DataFrame.

I'm thinking that using apply(), map(), or something similar might be better. I'm pretty new to Pandas though so still trying to wrap my head around everything.

TL;DR

I want to take a dictionary of the following form and apply each operation to a given Series object and return a 'filtered' Series object.

relops = {'>=': [1], '<=': [1]}

Long Example

I'll start with an example of what I have currently and just filtering a single Series object. Below is the function I'm currently using:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

The user provides a dictionary with the operations they want to perform:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

Again, the 'problem' with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.

Also, I would like to expand this so that the dictionary passed in can include the columns to operator on and filter an entire DataFrame based on the input dictionary. However, I'm assuming whatever works for the Series can be easily expanded to a DataFrame.

This question is related to python algorithm pandas

The answer is


Why not do this?

def filt_spec(df, col, val, op):
    import operator
    ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
    return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec

Demo:

df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')

Result:

   a  b
 1  2  4
 2  3  3
 3  4  2
 4  5  1

You can see that column 'a' has been filtered where a >=2.

This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.


e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.

For example

list = [1, 0]
df[df.col1.isin(list)]

Since pandas 0.22 update, comparison options are available like:

  • gt (greater than)
  • lt (lesser than)
  • eq (equals to)
  • ne (not equals to)
  • ge (greater than or equals to)

and many more. These functions return boolean array. Let's see how we can use them:

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is better 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15

Simplest of All Solutions:

Use:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

Chaining conditions creates long lines, which are discouraged by pep8. Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.

Once each of the filters is in place, one approach is

import numpy as np
import functools
def conjunction(*conditions):
    return functools.reduce(np.logical_and, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[conjunction(c1,c2,c3)]

np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.

Note that this still has some redundancies: a) shortcutting does not happen on a global level b) Each of the individual conditions runs on the whole initial data. Still, I expect this to be efficient enough for many applications and it is very readable.

You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or instead:

import numpy as np
import functools
def disjunction(*conditions):
    return functools.reduce(np.logical_or, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[disjunction(c1,c2,c3)]

If you want to check any/all of multiple columns for a value, you can do:

df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float