[python] how do you filter pandas dataframes by multiple columns

To filter a dataframe (df) by a single column, if we consider data with male and females we might:

males = df[df[Gender]=='Male']

Question 1 - But what if the data spanned multiple years and i wanted to only see males for 2014?

In other languages I might do something like:

if A = "Male" and if B = "2014" then 

(except I want to do this and get a subset of the original dataframe in a new dataframe object)

Question 2. How do I do this in a loop, and create a dataframe object for each unique sets of year and gender (i.e. a df for: 2013-Male, 2013-Female, 2014-Male, and 2014-Female

for y in year:

for g in gender:

df = .....

This question is related to python filter pandas

The answer is


You can create your own filter function using query in pandas. Here you have filtering of df results by all the kwargs parameters. Dont' forgot to add some validators(kwargs filtering) to get filter function for your own df.

def filter(df, **kwargs):
    query_list = []
    for key in kwargs.keys():
        query_list.append(f'{key}=="{kwargs[key]}"')
    query = ' & '.join(query_list)
    return df.query(query)

In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):

import pandas as pd
import numpy as np

length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)

%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]

Results for 100,000 rows:

6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results for 10,000,000 rows:

326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So results depend on the size and the data. On my laptop, query() gets faster after 500k rows. Further, the string search in Year=="2014" has an unnecessary overhead (Year==2014 is faster).


For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:

df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]

where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).


Start from pandas 0.13, this is the most efficient way.

df.query('Gender=="Male" & Year=="2014" ')

You can filter by multiple columns (more than two) by using the np.logical_and operator to replace & (or np.logical_or to replace |)

Here's an example function that does the job, if you provide target values for multiple fields. You can adapt it for different types of filtering and whatnot:

def filter_df(df, filter_values):
    """Filter df by matching targets for multiple columns.

    Args:
        df (pd.DataFrame): dataframe
        filter_values (None or dict): Dictionary of the form:
                `{<field>: <target_values_list>}`
            used to filter columns data.
    """
    import numpy as np
    if filter_values is None or not filter_values:
        return df
    return df[
        np.logical_and.reduce([
            df[column].isin(target_values) 
            for column, target_values in filter_values.items()
        ])
    ]

Usage:

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})

filter_df(df, {
    'a': [1, 2, 3],
    'b': [1, 2, 4]
})

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to filter

Monitoring the Full Disclosure mailinglist Pyspark: Filter dataframe based on multiple conditions How Spring Security Filter Chain works Copy filtered data to another sheet using VBA Filter object properties by key in ES6 How do I filter date range in DataTables? How do I filter an array with TypeScript in Angular 2? Filtering array of objects with lodash based on property value How to filter an array from all elements of another array How to specify "does not contain" in dplyr filter

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float