How to select rows from a DataFrame based on column values

Question

How can I select rows from a DataFrame based on values in some column in Pandas  In SQL  I would use  SELECT   FROM table WHERE colume name   some value  I tried to look at Pandas  documentation  but I did not immediately find the answer

User · Answer

I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val')

Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

In [167]: n = 10

In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [169]: df
Out[169]: 
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512

# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

You can also access variables in the environment by prepending an @.

exclude = ('red', 'orange')
df.query('color not in @exclude')

User · Answer

More flexibility using  query with Pandas  gt   0 25 0  August 2019 updated answer Since Pandas  gt   0 25 0 we can use the query method to filter dataframes with Pandas methods and even column names which have spaces  Normally the spaces in column names would give an error  but now we can solve that using a backtick     - see GitHub    Example dataframe df   pd DataFrame   Sender email    ex example com    quot reply shop com quot    quot buy shop com quot           Sender email 0  ex example com 1  reply shop com 2    buy shop com  Using  query with method str endswith  df query   Sender email  str endswith  quot  shop com quot      Output      Sender email 1  reply shop com 2    buy shop com   Also we can use local variables by prefixing it with an   in our query  domain    shop com  df query   Sender email  str endswith  domain     Output      Sender email 1  reply shop com 2    buy shop com

User · Answer

tl dr The Pandas equivalent to select   from table where column name   some value  is table table column name    some value   Multiple conditions  table  table column name    some value     table column name2    some value2    or table query  column name    some value   column name2    some value2    Code example import pandas as pd    Create data set d     foo   100  111  222         bar   333  444  555   df   pd DataFrame d     Full dataframe  df    Shows       bar   foo   0  333   100   1  444   111   2  555   222    Output only the row s  in df where foo is 222  df df foo    222     Shows       bar  foo   2  555  222  In the above code it is the line df df foo    222  that gives the rows based on the column value  222 in this case  Multiple conditions are also possible  df  df foo    222     df bar    444        bar  foo   1  444  111   2  555  222  But at that point I would recommend using the query function  since it s less verbose and yields the same result  df query  foo    222   bar    444

User · Answer

Here is a simple example    from pandas import DataFrame    Create data set d     Revenue   100 111 222          Cost   333 444 555   df   DataFrame d      mask   Return True when the value in column  Revenue  is equal to 111 mask   df  Revenue      111  print mask    Result    0    False   1     True   2    False   Name  Revenue  dtype  bool     Select   FROM df WHERE Revenue   111 df mask     Result       Cost    Revenue   1  444     111

User · Answer

Faster results can be achieved using numpy where    For example  with unubtu s setup -  In  76   df iloc np where df A values   foo    Out 76         A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14   Timing comparisons   In  68    timeit df iloc np where df A values   foo       fastest 1000 loops  best of 3  380   s per loop  In  69    timeit df loc df  A       foo   1000 loops  best of 3  745   s per loop  In  71    timeit df loc df  A   isin   foo     1000 loops  best of 3  562   s per loop  In  72    timeit df df A   foo   1000 loops  best of 3  796   s per loop  In  74    timeit df query   A   foo        slowest 1000 loops  best of 3  1 71 ms per loop

User · Answer

In newer versions of Pandas  inspired by the documentation  Viewing data   df df  quot colume name quot      some value   Scalar  True False    df df  quot colume name quot       quot some value quot    String  Combine multiple conditions by putting the clause in parentheses      and combining them with  amp  and    and or   Like this  df  df  quot colume name quot       quot some value1 quot    amp   pd pd  quot colume name quot       quot some value2 quot      Other filters pandas notna df  quot colume name quot       True   Not NaN df  colume name   str contains  quot text quot     Search for  quot text quot  df  colume name   str lower   str contains  quot text quot     Search for  quot text quot   after converting  to lowercase

User · Answer

For selecting only specific columns out of multiple columns for a given value in Pandas  select col name1  col name2 from table where column name   some value   Options  df loc df  column name      some value   col name1  col name2    or df query  column name      some value    col name1  col name2

User · Answer

To append to this famous question  though a bit too late   You can also do df groupby  column name   get group  column desired value   reset index   to make a new data frame with specified column having a particular value  E g   import pandas as pd df   pd DataFrame   A    foo bar foo bar foo bar foo foo  split                        B    one one two three two two one three  split     print  Original dataframe    print df   b is two dataframe   pd DataFrame df groupby  B   get group  two   reset index    drop  index   axis   1    NOTE  the final drop is to remove the extra index column returned by groupby object print  Sub dataframe where B is two    print b is two dataframe    Run this gives   Original dataframe       A      B 0  foo    one 1  bar    one 2  foo    two 3  bar  three 4  foo    two 5  bar    two 6  foo    one 7  foo  three Sub dataframe where B is two       A    B 0  foo  two 1  foo  two 2  bar  two

User · Answer

There are several ways to select rows from a Pandas dataframe   Boolean indexing  df df  col      value    Positional indexing  df iloc       Label indexing  df xs       df query      API  Below I show you examples of each  with advice when to use certain techniques  Assume our criterion is column  A      foo   Note on performance  For each base type  we can keep things simple by using the Pandas API or we can venture outside the API  usually into NumPy  and speed things up    Setup The first thing we ll need is to identify a condition that will act as our criterion for selecting rows  We ll start with the OP s case column name    some value  and include some other common use cases  Borrowing from  unutbu  import pandas as pd  numpy as np  df   pd DataFrame   A    foo bar foo bar foo bar foo foo  split                        B    one one two three two two one three  split                        C   np arange 8    D   np arange 8    2     1  Boolean indexing     Boolean indexing requires finding the true value of each row s  A  column being equal to  foo   then using those truth values to identify which rows to keep   Typically  we d name this series  an array of truth values  mask   We ll do so here as well  mask   df  A       foo   We can then use this mask to slice or index the data frame df mask        A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14  This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn t an issue  this should be your chosen method   However  if performance is a concern  then you might want to consider an alternative way of creating the mask   2  Positional indexing Positional indexing  df iloc       has its use cases  but this isn t one of them   In order to identify where to slice  we first need to perform the same boolean analysis we did above   This leaves us performing one extra step to accomplish the same task  mask   df  A       foo  pos   np flatnonzero mask  df iloc pos        A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14  3  Label indexing Label indexing can be very handy  but in this case  we are again doing more work for no benefit df set index  A   append True  drop False  xs  foo   level 1        A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14  4  df query   API pd DataFrame query is a very elegant intuitive way to perform this task  but is often slower  However  if you pay attention to the timings below  for large data  the query is very efficient  More so than the standard approach and of similar magnitude as my best suggestion  df query  A     quot foo quot          A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14   My preference is to use the Boolean mask Actual improvements can be made by modifying how we create our Boolean mask  mask alternative 1 Use the underlying NumPy array and forgo the overhead of creating another pd Series mask   df  A   values     foo   I ll show more complete time tests at the end  but just take a look at the performance gains we get using the sample data frame   First  we look at the difference in creating the mask  timeit mask   df  A   values     foo   timeit mask   df  A       foo   5 84   s    195 ns per loop  mean    std  dev  of 7 runs  100000 loops each  166   s    4 45   s per loop  mean    std  dev  of 7 runs  10000 loops each   Evaluating the mask with the NumPy array is   30 times faster   This is partly due to NumPy evaluation often being faster  It is also partly due to the lack of overhead necessary to build an index and a corresponding pd Series object  Next  we ll look at the timing for slicing with one mask versus the other  mask   df  A   values     foo   timeit df mask  mask   df  A       foo   timeit df mask   219   s    12 3   s per loop  mean    std  dev  of 7 runs  1000 loops each  239   s    7 03   s per loop  mean    std  dev  of 7 runs  1000 loops each   The performance gains aren t as pronounced   We ll see if this holds up over more robust testing   mask alternative 2 We could have reconstructed the data frame as well   There is a big caveat when reconstructing a dataframe   you must take care of the dtypes when doing so  Instead of df mask  we will do this pd DataFrame df values mask   df index mask   df columns  astype df dtypes   If the data frame is of mixed type  which our example is  then when we get df values the resulting array is of dtype object and consequently  all columns of the new data frame will be of dtype object   Thus requiring the astype df dtypes  and killing any potential performance gains   timeit df m   timeit pd DataFrame df values mask   df index mask   df columns  astype df dtypes   216   s    10 4   s per loop  mean    std  dev  of 7 runs  1000 loops each  1 43 ms    39 6   s per loop  mean    std  dev  of 7 runs  1000 loops each   However  if the data frame is not of mixed type  this is a very useful way to do it  Given np random seed  3 1415   d1   pd DataFrame np random randint 10  size  10  5    columns list  ABCDE     d1     A  B  C  D  E 0  0  2  7  3  8 1  7  0  6  8  6 2  0  2  0  4  9 3  7  3  2  4  3 4  3  6  7  7  4 5  5  3  7  5  9 6  8  7  6  4  7 7  6  2  6  6  5 8  2  8  7  5  8 9  4  7  6  1  5     timeit mask   d1  A   values    7 d1 mask   179   s    8 73   s per loop  mean    std  dev  of 7 runs  10000 loops each   Versus   timeit mask   d1  A   values    7 pd DataFrame d1 values mask   d1 index mask   d1 columns   87   s    5 12   s per loop  mean    std  dev  of 7 runs  10000 loops each   We cut the time in half   mask alternative 3  unutbu also shows us how to use pd Series isin to account for each element of df  A   being in a set of values   This evaluates to the same thing if our set of values is a set of one value  namely  foo    But it also generalizes to include larger sets of values if needed   Turns out  this is still pretty fast even though it is a more general solution   The only real loss is in intuitiveness for those not familiar with the concept  mask   df  A   isin   foo    df mask        A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14  However  as before  we can utilize NumPy to improve performance while sacrificing virtually nothing  We ll use np in1d mask   np in1d df  A   values    foo    df mask        A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14   Timing I ll include other concepts mentioned in other posts as well for reference  Code Below Each column in this table represents a different length data frame over which we test each function  Each column shows relative time taken  with the fastest function given a base index of 1 0  res div res min                              10        30        100       300       1000      3000      10000     30000 mask standard         2 156872  1 850663  2 034149  2 166312  2 164541  3 090372  2 981326  3 131151 mask standard loc     1 879035  1 782366  1 988823  2 338112  2 361391  3 036131  2 998112  2 990103 mask with values      1 010166  1 000000  1 005113  1 026363  1 028698  1 293741  1 007824  1 016919 mask with values loc  1 196843  1 300228  1 000000  1 000000  1 038989  1 219233  1 037020  1 000000 query                 4 997304  4 765554  5 934096  4 500559  2 997924  2 397013  1 680447  1 398190 xs label              4 124597  4 272363  5 596152  4 295331  4 676591  5 710680  6 032809  8 950255 mask with isin        1 674055  1 679935  1 847972  1 724183  1 345111  1 405231  1 253554  1 264760 mask with in1d        1 000000  1 083807  1 220493  1 101929  1 000000  1 000000  1 000000  1 144175  You ll notice that the fastest times seem to be shared between mask with values and mask with in1d  res T plot loglog True    Functions def mask standard df       mask   df  A       foo      return df mask   def mask standard loc df       mask   df  A       foo      return df loc mask   def mask with values df       mask   df  A   values     foo      return df mask   def mask with values loc df       mask   df  A   values     foo      return df loc mask   def query df       return df query  A     quot foo quot     def xs label df       return df set index  A   append True  drop False  xs  foo   level -1   def mask with isin df       mask   df  A   isin   foo        return df mask   def mask with in1d df       mask   np in1d df  A   values    foo        return df mask    Testing res   pd DataFrame      index            mask standard    mask standard loc    mask with values    mask with values loc            query    xs label    mask with isin    mask with in1d             columns  10  30  100  300  1000  3000  10000  30000       dtype float    for j in res columns      d   pd concat  df    j  ignore index True      for i in res index a         stmt       d   format i          setp    from   main   import d      format i          res at i  j    timeit stmt  setp  number 50    Special Timing Looking at the special case when we have a single non-object dtype for the entire data frame  Code Below spec div spec min                          10        30        100       300       1000      3000      10000     30000 mask with values  1 009030  1 000000  1 194276  1 000000  1 236892  1 095343  1 000000  1 000000 mask with in1d    1 104638  1 094524  1 156930  1 072094  1 000000  1 000000  1 040043  1 027100 reconstruct       1 000000  1 142838  1 000000  1 355440  1 650270  2 222181  2 294913  3 406735  Turns out  reconstruction isn t worth it past a few hundred rows  spec T plot loglog True    Functions np random seed  3 1415   d1   pd DataFrame np random randint 10  size  10  5    columns list  ABCDE     def mask with values df       mask   df  A   values     foo      return df mask   def mask with in1d df       mask   np in1d df  A   values    foo        return df mask   def reconstruct df       v   df values     mask   np in1d df  A   values    foo        return pd DataFrame v mask   df index mask   df columns   spec   pd DataFrame      index   mask with values    mask with in1d    reconstruct        columns  10  30  100  300  1000  3000  10000  30000       dtype float    Testing for j in spec columns      d   pd concat  df    j  ignore index True      for i in spec index          stmt       d   format i          setp    from   main   import d      format i          spec at i  j    timeit stmt  setp  number 50

User · Answer

To select rows whose column value equals a scalar  some value  use      df loc df  column name      some value    To select rows whose column value is in an iterable  some values  use isin   df loc df  column name   isin some values     Combine multiple conditions with  amp     df loc  df  column name    gt   A   amp   df  column name    lt   B     Note the parentheses  Due to Python s operator precedence rules   amp  binds more tightly than  lt   and  gt    Thus  the parentheses in the last example are necessary  Without the parentheses   df  column name    gt   A  amp  df  column name    lt   B   is parsed as   df  column name    gt    A  amp  df  column name     lt   B   which results in a Truth value of a Series is ambiguous error     To select rows whose column value does not equal some value  use      df loc df  column name      some value    isin returns a boolean Series  so to select rows whose value is not in some values  negate the boolean Series using     df loc  df  column name   isin some values       For example   import pandas as pd import numpy as np df   pd DataFrame   A    foo bar foo bar foo bar foo foo  split                        B    one one two three two two one three  split                        C   np arange 8    D   np arange 8    2   print df         A      B  C   D   0  foo    one  0   0   1  bar    one  1   2   2  foo    two  2   4   3  bar  three  3   6   4  foo    two  4   8   5  bar    two  5  10   6  foo    one  6  12   7  foo  three  7  14  print df loc df  A       foo      yields       A      B  C   D 0  foo    one  0   0 2  foo    two  2   4 4  foo    two  4   8 6  foo    one  6  12 7  foo  three  7  14     If you have multiple values you want to include  put them in a list  or more generally  any iterable  and use isin   print df loc df  B   isin   one   three        yields       A      B  C   D 0  foo    one  0   0 1  bar    one  1   2 3  bar  three  3   6 6  foo    one  6  12 7  foo  three  7  14     Note  however  that if you wish to do this many times  it is more efficient to make an index first  and then use df loc   df   df set index   B    print df loc  one      yields         A  C   D B               one  foo  0   0 one  bar  1   2 one  foo  6  12   or  to include multiple values from the index use df index isin   df loc df index isin   one   two       yields         A  C   D B               one  foo  0   0 one  bar  1   2 two  foo  2   4 two  foo  4   8 two  bar  5  10 one  foo  6  12

User · Answer

You can also use  apply   df apply lambda row  row df  B   isin   one   three        It actually works row-wise  i e   applies the function to each row    The output is      A      B  C   D 0  foo    one  0   0 1  bar    one  1   2 3  bar  three  3   6 6  foo    one  6  12 7  foo  three  7  14   The results is the same as using as mentioned by  unutbu  df  df  B   isin   one   three

[python] How to select rows from a DataFrame based on column values

Examples related to python

Examples related to pandas

Examples related to dataframe