how do you filter pandas dataframes by multiple columns

Question

To filter a dataframe  df  by a single column  if we consider data with male and females we might   males   df df Gender    Male     Question 1 - But what if the data spanned multiple years and i wanted to only see males for 2014   In other languages I might do something like    if A    Male  and if B    2014  then     except I want to do this and get a subset of the original dataframe in a new dataframe object   Question 2  How do I do this in a loop  and create a dataframe object for each unique sets of year and gender  i e  a df for  2013-Male  2013-Female  2014-Male  and 2014-Female  for y in year   for g in gender   df

User · Answer

For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:

df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]

where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).

User · Answer

In case somebody wonders what is the faster way to filter  the accepted answer or the one from  redreamality    import pandas as pd import numpy as np  length   100 000 df   pd DataFrame   df  Year     np random randint 1950  2019  size length  df  Gender     np random choice   Male    Female    length    timeit df query  Gender   Male   amp  Year   2014      timeit df  df  Gender     Male    amp   df  Year    2014     Results for 100 000 rows   6 67 ms    557   s per loop  mean    std  dev  of 7 runs  100 loops each  5 54 ms    536   s per loop  mean    std  dev  of 7 runs  100 loops each    Results for 10 000 000 rows   326 ms    6 52 ms per loop  mean    std  dev  of 7 runs  1 loop each  472 ms    25 1 ms per loop  mean    std  dev  of 7 runs  1 loop each    So results depend on the size and the data  On my laptop  query   gets faster after 500k rows  Further  the string search in Year   2014  has an unnecessary overhead  Year  2014 is faster

User · Answer

Using  amp  operator  don t forget to wrap the sub-statements with      males   df  df Gender    Male    amp   df Year   2014     To store your dataframes in a dict using a for loop   from collections import defaultdict dic    for g in   male    female      dic g  defaultdict dict    for y in  2013  2014       dic g  y  df  df Gender   g   amp   df Year   y    store the DataFrames to a dict of dict   EDIT   A demo for your getDF   def getDF dic  gender  year     return dic gender  year   print genDF dic   male   2014

User · Answer

You can filter by multiple columns  more than two  by using the np logical and operator to replace  amp   or np logical or to replace     Here s an example function that does the job  if you provide target values for multiple fields  You can adapt it for different types of filtering and whatnot   def filter df df  filter values          Filter df by matching targets for multiple columns       Args          df  pd DataFrame   dataframe         filter values  None or dict   Dictionary of the form                     lt field gt    lt target values list gt                used to filter columns data              import numpy as np     if filter values is None or not filter values          return df     return df          np logical and reduce               df column  isin target values               for column  target values in filter values items                      Usage   df   pd DataFrame   a    1  2  3  4    b    1  2  3  4     filter df df         a    1  2  3        b    1  2  4

User · Answer

Start from pandas 0 13  this is the most efficient way   df query  Gender   Male   amp  Year   2014

User · Answer

You can create your own filter function using query in pandas  Here you have filtering of df results by all the kwargs parameters  Dont  forgot to add some validators kwargs filtering  to get filter function for your own df   def filter df    kwargs       query list          for key in kwargs keys            query list append f  key     kwargs key          query      amp    join query list      return df query query

[python] how do you filter pandas dataframes by multiple columns

Examples related to python

Examples related to filter

Examples related to pandas