Efficient way to apply multiple filters to pandas DataFrame or Series

Question

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object   Essentially  I want to efficiently chain a bunch of filtering  comparison operations  together that are specified at run-time by the user   The filters should be additive  aka each one applied should narrow results    I m currently using reindex   but this creates a new object each time and copies the underlying data  if I understand the documentation correctly    So  this could be really inefficient when filtering a big Series or DataFrame   I m thinking that using apply    map    or something similar might be better   I m pretty new to Pandas though so still trying to wrap my head around everything   TL DR  I want to take a dictionary of the following form and apply each operation to a given Series object and return a  filtered  Series object   relops      gt      1     lt      1     Long Example  I ll start with an example of what I have currently and just filtering a single Series object   Below is the function I m currently using      def apply relops series  relops                       Pass dictionary of relational operators to perform on given series object                     for op  vals in relops iteritems                op func   ops op              for val in vals                  filtered   op func series  val                  series   series reindex series filtered           return series   The user provides a dictionary with the operations they want to perform    gt  gt  gt  df   pandas DataFrame   col1    0  1  2    col2    10  11  12     gt  gt  gt  print df  gt  gt  gt  print df    col1  col2 0     0    10 1     1    11 2     2    12   gt  gt  gt  from operator import le  ge  gt  gt  gt  ops     gt     ge    lt     le   gt  gt  gt  apply relops df  col1       gt      1    col1 1       1 2       2 Name  col1  gt  gt  gt  apply relops df  col1    relops      gt      1     lt      1    col1 1       1 Name  col1   Again  the  problem  with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps   Also  I would like to expand this so that the dictionary passed in can include the columns to operator on and filter an entire DataFrame based on the input dictionary   However  I m assuming whatever works for the Series can be easily expanded to a DataFrame

User · Answer

e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.

For example

list = [1, 0]
df[df.col1.isin(list)]

User · Answer

Since pandas 0 22 update  comparison options are available like    gt  greater than  lt  lesser than  eq  equals to  ne  not equals to  ge  greater than or equals to    and many more  These functions return boolean array  Let s see how we can use them     sample data df   pd DataFrame   col1    0  1  2 3 4 5    col2    10  11  12 13 14 15       get values from col1 greater than or equals to 1 df loc df  col1   ge 1   col1    1    1 2    2 3    3 4    4 5    5    where co11 values is better 0 and 2 df loc df  col1   between 0 2     col1 col2 0   0   10 1   1   11 2   2   12    where col1  gt  1 df loc df  col1   gt 1     col1 col2 2   2   12 3   3   13 4   4   14 5   5   15

User · Answer

If you want to check any all of multiple columns for a value  you can do  df  df   HomeTeam    AwayTeam        Fulham   any axis 1

User · Answer

Why not do this   def filt spec df  col  val  op       import operator     ops     eq   operator eq   neq   operator ne   gt   operator gt   ge   operator ge   lt   operator lt   le   operator le      return df ops op  df col   val   pandas DataFrame filt spec   filt spec   Demo   df   pd DataFrame   a    1 2 3 4 5    b   5 4 3 2 1    df filt spec  a   2   ge     Result      a  b  1  2  4  2  3  3  3  4  2  4  5  1   You can see that column  a  has been filtered where a   2   This is slightly faster  typing time  not performance  than operator chaining  You could of course put the import at the top of the file

User · Answer

Chaining conditions creates long lines  which are discouraged by pep8  Using the  query method forces to use strings  which is powerful but unpythonic and not very dynamic  Once each of the filters is in place  one approach is import numpy as np import functools def conjunction  conditions       return functools reduce np logical and  conditions   c 1   data col1    True c 2   data col2  lt  64 c 3   data col3    4  data filtered   data conjunction c1 c2 c3    np logical operates on and is fast  but does not take more than two arguments  which is handled by functools reduce  Note that this still has some redundancies  a  shortcutting does not happen on a global level b  Each of the individual conditions runs on the whole initial data  Still  I expect this to be efficient enough for many applications and it is very readable  You can also make a disjunction  wherein only one of the conditions needs to be true  by using np logical or instead  import numpy as np import functools def disjunction  conditions       return functools reduce np logical or  conditions   c 1   data col1    True c 2   data col2  lt  64 c 3   data col3    4  data filtered   data disjunction c1 c2 c3

User · Answer

Pandas  and numpy  allow for boolean indexing  which will be much more efficient   In  11   df loc df  col1    gt   1   col1   Out 11    1    1 2    2 Name  col1  In  12   df df  col1    gt   1  Out 12       col1  col2 1     1    11 2     2    12  In  13   df  df  col1    gt   1   amp   df  col1    lt  1    Out 13       col1  col2 1     1    11   If you want to write helper functions for this  consider something along these lines   In  14   def b x  col  op  n                 return op x col  n   In  15   def f x   b                return x  np logical and  b     In  16   b1   b df   col1   ge  1   In  17   b2   b df   col1   le  1   In  18   f df  b1  b2  Out 18       col1  col2 1     1    11   Update  pandas 0 13 has a query method for these kind of use cases  assuming column names are valid identifiers the following works  and can be more efficient for large frames as it uses numexpr behind the scenes    In  21   df query  col1  lt   1  amp  1  lt   col1   Out 21      col1  col2 1     1    11

User · Answer

Simplest of All Solutions   Use   filtered df   df  df  col1    gt   1   amp   df  col1    lt   5     Another Example  To filter the dataframe for values belonging to Feb-2018  use the below code   filtered df   df  df  year      2018   amp   df  month      2

[python] Efficient way to apply multiple filters to pandas DataFrame or Series

Examples related to python

Examples related to algorithm

Examples related to pandas