[python] Get first row of dataframe in Python Pandas based on criteria

Let's say that I have a dataframe like this one

import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=['A', 'B', 'C'])

>> df
   A  B  C
0  1  2  1
1  1  3  2
2  4  6  3
3  4  3  4
4  5  4  5

The original table is more complicated with more columns and rows.

I want to get the first row that fulfil some criteria. Examples:

  1. Get first row where A > 3 (returns row 2)
  2. Get first row where A > 4 AND B > 3 (returns row 4)
  3. Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)

But, if there isn't any row that fulfil the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc)

  1. Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)

I was able to do it by iterating on the dataframe (I know that craps :P). So, I prefer a more pythonic way to solve it.

This question is related to python pandas

The answer is


you can take care of the first 3 items with slicing and head:

  1. df[df.A>=4].head(1)
  2. df[(df.A>=4)&(df.B>=3)].head(1)
  3. df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)

The condition in case nothing comes back you can handle with a try or an if...

try:
    output = df[df.A>=6].head(1)
    assert len(output) == 1
except: 
    output = df.sort_values('A',ascending=False).head(1)

For existing matches, use query:

df.query(' A > 3' ).head(1)
Out[33]: 
   A  B  C
2  4  6  3

df.query(' A > 4 and B > 3' ).head(1)
Out[34]: 
   A  B  C
4  5  4  5

df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]: 
   A  B  C
2  4  6  3

For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:

def pd_iter_func(df):
    for row in df.itertuples():
        # Define your criteria here
        if row.A > 4 and row.B > 3:
            return row

It is more efficient than Boolean Indexing when it comes to a large dataframe.

To make the function above more applicable, one can implements lambda functions:

def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
    for row in df.itertuples():
        if criteria(row):
            return row

pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)

As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.

def pd_idxmax_func(df, mask):
    return df.loc[mask.idxmax()]

pd_idxmax_func(df, (df.A > 4) & (df.B > 3))