pandas filter rows of DataFrame with operator chaining

Question

Most operations in pandas can be accomplished with operator chaining  groupby  aggregate  apply  etc   but the only way I ve found to filter rows is via normal bracket indexing  df filtered   df df  column      value    This is unappealing as it requires I assign df to a variable before being able to filter on its values   Is there something more like the following   df filtered   df mask lambda x  x  column      value

User · Answer

The answer from  lodagro is great  I would extend it by generalizing the mask function as   def mask df  f     return df f df     Then you can do stuff like   df mask lambda x  x 0   lt  0  mask lambda x  x 1   gt  0

User · Answer

Since version 0 18 1 the  loc method accepts a callable for selection  Together with lambda functions you can create very flexible chainable filters   import numpy as np import pandas as pd  df   pd DataFrame np random randint 0 100 size  100  4    columns list  ABCD    df loc lambda df  df A    80     equivalent to df df A    80  but chainable  df sort values  A   loc lambda df  df A  gt  80  loc lambda df  df B  gt  df A    If all you re doing is filtering  you can also omit the  loc

User · Answer

You can also leverage the numpy library for logical operations  Its pretty fast    df np logical and df  A      1  df  B      6

User · Answer

If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows   pd DataFrame   apply masks     Usage   A   pd DataFrame np random randn 4  4   columns   A    B    C    D    A le mask  A   0 7  ge mask  B   0 2      May be repeated as necessary   It s a little bit hacky but it can make things a little bit cleaner if you re continuously chopping and changing datasets according to filters  There s also a general purpose filter adapted from Daniel Velkov above in the gen mask function which you can use with lambda functions or otherwise if desired   File to be saved  I use masks py    import pandas as pd  def eq mask df  key  value       return df df key     value   def ge mask df  key  value       return df df key   gt   value   def gt mask df  key  value       return df df key   gt  value   def le mask df  key  value       return df df key   lt   value   def lt mask df  key  value       return df df key   lt  value   def ne mask df  key  value       return df df key     value   def gen mask df  f       return df f df    def apply masks         pd DataFrame eq mask   eq mask     pd DataFrame ge mask   ge mask     pd DataFrame gt mask   gt mask     pd DataFrame le mask   le mask     pd DataFrame lt mask   lt mask     pd DataFrame ne mask   ne mask     pd DataFrame gen mask   gen mask      return pd DataFrame  if   name         main         pass

User · Answer

I offer this for additional examples   This is the same answer as https   stackoverflow com a 28159296    I ll add other edits to make this post more useful   pandas DataFrame query query was made for exactly this purpose   Consider the dataframe df  import pandas as pd import numpy as np  np random seed  3 1415   df   pd DataFrame      np random randint 10  size  10  5        columns list  ABCDE      df     A  B  C  D  E 0  0  2  7  3  8 1  7  0  6  8  6 2  0  2  0  4  9 3  7  3  2  4  3 4  3  6  7  7  4 5  5  3  7  5  9 6  8  7  6  4  7 7  6  2  6  6  5 8  2  8  7  5  8 9  4  7  6  1  5   Let s use query to filter all rows where D  gt  B  df query  D  gt  B       A  B  C  D  E 0  0  2  7  3  8 1  7  0  6  8  6 2  0  2  0  4  9 3  7  3  2  4  3 4  3  6  7  7  4 5  5  3  7  5  9 7  6  2  6  6  5   Which we chain  df query  D  gt  B   query  C  gt  B     equivalent to   df query  D  gt  B and C  gt  B     but defeats the purpose of demonstrating chaining     A  B  C  D  E 0  0  2  7  3  8 1  7  0  6  8  6 4  3  6  7  7  4 5  5  3  7  5  9 7  6  2  6  6  5

User · Answer

pandas provides two alternatives to Wouter Overmeire s answer which do not require any overriding  One is  loc    with a callable  as in  df filtered   df loc lambda x  x  column      value    the other is  pipe    as in  df filtered   df pipe lambda x  x  column      value

User · Answer

This solution is more hackish in terms of implementation  but I find it much cleaner in terms of usage  and it is certainly more general than the others proposed   https   github com toobaz generic utils blob master generic utils pandas where py  You don t need to download the entire repo  saving the file and doing  from where import where as W   should suffice  Then you use it like this   df   pd DataFrame   1  2  True                       3  4  False                        5  7  True                      index range 3   columns   a    b    c      On specific column  print df loc W  a    gt  2   print df loc -W  a      W  b     print df loc  W  c       On entire - or subset of a - DataFrame  print df loc W sum axis 1   gt  3   print df loc W   a    b    diff axis 1   b    gt  1     A slightly less stupid usage example   data   pd read csv  ugly db csv   loc   W      null    any axis 1     By the way  even in the case in which you are just using boolean cols   df loc W  cond1    loc W  cond2      can be much more efficient than  df loc W  cond1    amp  W  cond2      because it evaluates cond2 only where cond1 is True   DISCLAIMER  I first gave this answer elsewhere because I hadn t seen this

User · Answer

If you set your columns to search as indexes  then you can use DataFrame xs   to take a cross section  This is not as versatile as the query answers  but it might be useful in some situations   import pandas as pd import numpy as np  np random seed  3 1415   df   pd DataFrame      np random randint 3  size  10  5        columns list  ABCDE      df   Out 55         A  B  C  D  E   0  0  2  2  2  2   1  1  1  2  0  2   2  0  2  0  0  2   3  0  2  2  0  1   4  0  1  1  2  0   5  0  0  0  1  2   6  1  0  1  1  1   7  0  0  2  0  2   8  2  2  2  2  2   9  1  2  0  2  1  df set index   A    D    xs  0  2   reset index     Out 57         A  D  B  C  E   0  0  2  2  2  2   1  0  2  1  1  0

User · Answer

I had the same question except that I wanted to combine the criteria into an OR condition   The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied   In  96   df Out 96      A  B  C  D a  1  4  9  1 b  4  5  0  2 c  5  5  1  0 d  1  3  9  6  In  99   df  df A    1   amp   df D    6   Out 99      A  B  C  D d  1  3  9  6   But I found that  if you wrap each condition in         True  and join the criteria with a pipe  the criteria are combined in an OR condition  satisfied whenever either of them is true    df   df A  1     True      df D  6     True

User · Answer

My answer is similar to the others  If you do not want to create a new function you can use what pandas has defined for you already  Use the pipe method   df pipe lambda d  d d  column      value

User · Answer

Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation   The code below can filter the rows by value   df filtered   df loc df  column      value    By modifying it a bit you can filter the columns as well   df filtered   df loc df  column      value    year    column      So why do we want a chained method  The answer is that it is simple to read if you have many operations  For example   res    df       loc df  station     USA     TEMP    RF          groupby  year         agg np nanmean

User · Answer

This is unappealing as it requires I assign df to a variable before being able to filter on its values     df df  column name      5  groupby  other column name     seems to work  you can nest the    operator as well  Maybe they added it since you asked the question

User · Answer

I m not entirely sure what you want  and your last line of code does not help either  but anyway    Chained  filtering is done by  chaining  the criteria in the boolean index   In  96   df Out 96      A  B  C  D a  1  4  9  1 b  4  5  0  2 c  5  5  1  0 d  1  3  9  6  In  99   df  df A    1   amp   df D    6   Out 99      A  B  C  D d  1  3  9  6   If you want to chain methods  you can add your own mask method and use that one   In  90   def mask df  key  value                return df df key     value            In  92   pandas DataFrame mask   mask  In  93   df   pandas DataFrame np random randint 0  10   4 4    index list  abcd    columns list  ABCD     In  95   df ix  d   A     df ix  a    A    In  96   df Out 96      A  B  C  D a  1  4  9  1 b  4  5  0  2 c  5  5  1  0 d  1  3  9  6  In  97   df mask  A   1  Out 97      A  B  C  D a  1  4  9  1 d  1  3  9  6  In  98   df mask  A   1  mask  D   6  Out 98      A  B  C  D d  1  3  9  6

User · Answer

Filters can be chained using a Pandas query   df   pd DataFrame np random randn 30  3   columns   a   b   c    df filtered   df query  a  gt  0   query  0  lt  b  lt  2     Filters can also be combined in a single query   df filtered   df query  a  gt  0 and 0  lt  b  lt  2

[python] pandas: filter rows of DataFrame with operator chaining

Examples related to python

Examples related to pandas

Examples related to dataframe