[python] Apply multiple functions to multiple groupby columns

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

For example, I've tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?

Thanks

This question is related to python group-by aggregate-functions pandas

The answer is


New in version 0.25.0.

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
    In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
       ....:                         'height': [9.1, 6.0, 9.5, 34.0],
       ....:                         'weight': [7.9, 7.5, 9.9, 198.0]})
       ....: 

    In [80]: animals
    Out[80]: 
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0

    In [81]: animals.groupby("kind").agg(
       ....:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
       ....:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
       ....:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
       ....: )
       ....: 
    Out[81]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.

    In [82]: animals.groupby("kind").agg(
       ....:     min_height=('height', 'min'),
       ....:     max_height=('height', 'max'),
       ....:     average_weight=('weight', np.mean),
       ....: )
       ....: 
    Out[82]: 
          min_height  max_height  average_weight
    kind                                        
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().

Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

    In [84]: animals.groupby("kind").height.agg(
       ....:     min_height='min',
       ....:     max_height='max',
       ....: )
       ....: 
    Out[84]: 
          min_height  max_height
    kind                        
    cat          9.1         9.5
    dog          6.0        34.0

This is a twist on 'exans' answer that uses Named Aggregations. It's the same but with argument unpacking which allows you to still pass in a dictionary to the agg function.

The named aggs are a nice feature, but at first glance might seem hard to write programmatically since they use keywords, but it's actually simple with argument/keyword unpacking.

animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                         'height': [9.1, 6.0, 9.5, 34.0],
                         'weight': [7.9, 7.5, 9.9, 198.0]})
 
agg_dict = {
    "min_height": pd.NamedAgg(column='height', aggfunc='min'),
    "max_height": pd.NamedAgg(column='height', aggfunc='max'),
    "average_weight": pd.NamedAgg(column='weight', aggfunc=np.mean)
}

animals.groupby("kind").agg(**agg_dict)

The Result

      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

As an alternative (mostly on aesthetics) to Ted Petrou's answer, I found I preferred a slightly more compact listing. Please don't consider accepting it, it's just a much-more-detailed comment on Ted's answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:

df.groupby('group') \
  .apply(lambda x: pd.Series({
      'a_sum'       : x['a'].sum(),
      'a_max'       : x['a'].max(),
      'b_mean'      : x['b'].mean(),
      'c_d_prodsum' : (x['c'] * x['d']).sum()
  })
)

          a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.530559  0.374540  0.553354     0.488525
1      1.433558  0.832443  0.460206     0.053313

I find it more reminiscent of dplyr pipes and data.table chained commands. Not to say they're better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized def functions for these types of operations. This is just an alternative, not necessarily better.)


I generated data in the same manner as Ted, I'll add a seed for reproducibility.

import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.374540  0.950714  0.731994  0.598658      0
1  0.156019  0.155995  0.058084  0.866176      0
2  0.601115  0.708073  0.020584  0.969910      1
3  0.832443  0.212339  0.181825  0.183405      1

Pandas >= 0.25.0, named aggregations

Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:

Example:

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]

          a         b         c         d  group
0  0.521279  0.914988  0.054057  0.125668      0
1  0.426058  0.828890  0.784093  0.446211      0
2  0.363136  0.843751  0.184967  0.467351      1
3  0.241012  0.470053  0.358018  0.525032      1

Apply GroupBy.agg with named aggregation:

df.groupby('group').agg(
             a_sum=('a', 'sum'),
             a_mean=('a', 'mean'),
             b_mean=('b', 'mean'),
             c_sum=('c', 'sum'),
             d_range=('d', lambda x: x.max() - x.min())
)

          a_sum    a_mean    b_mean     c_sum   d_range
group                                                  
0      0.947337  0.473668  0.871939  0.838150  0.320543
1      0.604149  0.302074  0.656902  0.542985  0.057681

For the first part you can pass a dict of column names for keys and a list of functions for the values:

In [28]: df
Out[28]:
          A         B         C         D         E  GRP
0  0.395670  0.219560  0.600644  0.613445  0.242893    0
1  0.323911  0.464584  0.107215  0.204072  0.927325    0
2  0.321358  0.076037  0.166946  0.439661  0.914612    1
3  0.133466  0.447946  0.014815  0.130781  0.268290    1

In [26]: f = {'A':['sum','mean'], 'B':['prod']}

In [27]: df.groupby('GRP').agg(f)
Out[27]:
            A                   B
          sum      mean      prod
GRP
0    0.719580  0.359790  0.102004
1    0.454824  0.227412  0.034060

UPDATE 1:

Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

Here's a hacky workaround:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}

In [69]: df.groupby('GRP').agg(f)
Out[69]:
            A                   B         D
          sum      mean      prod  <lambda>
GRP
0    0.719580  0.359790  0.102004  1.170219
1    0.454824  0.227412  0.034060  1.182901

Here, the resultant 'D' column is made up of the summed 'E' values.

UPDATE 2:

Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()

In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}

In [97]: df.groupby('GRP').agg(f)
Out[97]:
            A                   B         D
          sum      mean      prod   my name
GRP
0    0.719580  0.359790  0.102004  0.204072
1    0.454824  0.227412  0.034060  0.570441

Ted's answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:

create a dataframe

df=pd.DataFrame({'a': [1,2,3,4,5,6], 'b': [1,1,0,1,1,0], 'c': ['x','x','y','y','z','z']})


   a  b  c
0  1  1  x
1  2  1  x
2  3  0  y
3  4  1  y
4  5  1  z
5  6  0  z

grouping and aggregating with apply (using multiple columns)

df.groupby('c').apply(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())

c
x    2.0
y    4.0
z    5.0

grouping and aggregating with aggregate (using multiple columns)

I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.

It seems obvious now, but as long as you don't select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.

only access to the selected column

df.groupby('c')['a'].aggregate(lambda x: x[x>1].mean())

access to all columns since selection is after all the magic

df.groupby('c').aggregate(lambda x: x[(x['a']>1) & (x['b']==1)].mean())['a']

or similarly

df.groupby('c').aggregate(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())

I hope this helps.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to group-by

SELECT list is not in GROUP BY clause and contains nonaggregated column .... incompatible with sql_mode=only_full_group_by Count unique values using pandas groupby Pandas group-by and sum Count unique values with pandas per groups Group dataframe and get sum AND count? Error related to only_full_group_by when executing a query in MySql Pandas sum by groupby, but exclude certain columns Using DISTINCT along with GROUP BY in SQL Server Python Pandas : group by in group by and average? How do I create a new column from the output of pandas groupby().sum()?

Examples related to aggregate-functions

Spark SQL: apply aggregate functions to a list of columns GROUP BY without aggregate function GROUP BY + CASE statement must appear in the GROUP BY clause or be used in an aggregate function Naming returned columns in Pandas aggregate function? Concatenate multiple result rows of one column into one, group by another column How to include "zero" / "0" results in COUNT aggregate? Apply multiple functions to multiple groupby columns Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause Optimal way to concatenate/aggregate strings

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float