Apply multiple functions to multiple groupby columns

Question

The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys   In  563   grouped  D   agg   result1    np sum                               result2    np mean             Out 563          result2   result1 A                       bar -0 579846 -1 739537 foo -0 280588 -1 402938   However  this only works on a Series groupby object  And when a dict is similarly passed to a groupby DataFrame  it expects the keys to be the column names that the function will be applied to   What I want to do is apply multiple functions to several columns  but certain columns will be operated on multiple times   Also  some functions will depend on other columns in the groupby object  like sumif functions   My current solution is to go column by column  and doing something like the code above  using lambdas for functions that depend on other rows  But this is taking a long time   I think it takes a long time to iterate through a groupby object   I ll have to change it so that I iterate through the whole groupby object in a single run  but I m wondering if there s a built in way in pandas to do this somewhat cleanly   For example  I ve tried something like   grouped agg   C sum    lambda x  x  C   sum                  C std   lambda x  x  C   std                  D sum    lambda x  x  D   sum                   D sumifC3   lambda x  x  D   x  C      3  sum           but as expected I get a KeyError  since the keys have to be a column if agg is called from a DataFrame    Is there any built in way to do what I d like to do  or a possibility that this functionality may be added  or will I just need to iterate through the groupby manually   Thanks

User · Accepted Answer

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using `apply` and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494

User · Answer

For the first part you can pass a dict of column names for keys and a list of functions for the values   In  28   df Out 28             A         B         C         D         E  GRP 0  0 395670  0 219560  0 600644  0 613445  0 242893    0 1  0 323911  0 464584  0 107215  0 204072  0 927325    0 2  0 321358  0 076037  0 166946  0 439661  0 914612    1 3  0 133466  0 447946  0 014815  0 130781  0 268290    1  In  26   f     A    sum   mean     B    prod     In  27   df groupby  GRP   agg f  Out 27               A                   B           sum      mean      prod GRP 0    0 719580  0 359790  0 102004 1    0 454824  0 227412  0 034060   UPDATE 1   Because the aggregate function works on Series  references to the other column names are lost   To get around this  you can reference the full dataframe and index it using the group indices within the lambda function   Here s a hacky workaround   In  67   f     A    sum   mean     B    prod     D   lambda g  df loc g index  E sum     In  69   df groupby  GRP   agg f  Out 69               A                   B         D           sum      mean      prod   lt lambda gt  GRP 0    0 719580  0 359790  0 102004  1 170219 1    0 454824  0 227412  0 034060  1 182901   Here  the resultant  D  column is made up of the summed  E  values   UPDATE 2   Here s a method that I think will do everything you ask   First make a custom lambda function   Below  g references the group   When aggregating  g will be a Series   Passing g index to df ix   selects the current group from df   I then test if column C is less than 0 5   The returned boolean series is passed to g   which selects only those rows meeting the criteria   In  95   cust   lambda g  g df loc g index   C    lt  0 5  sum    In  96   f     A    sum   mean     B    prod     D     my name   cust    In  97   df groupby  GRP   agg f  Out 97               A                   B         D           sum      mean      prod   my name GRP 0    0 719580  0 359790  0 102004  0 204072 1    0 454824  0 227412  0 034060  0 570441

User · Answer

This is a twist on  exans  answer that uses Named Aggregations  It s the same but with argument unpacking which allows you to still pass in a dictionary to the agg function  The named aggs are a nice feature  but at first glance might seem hard to write programmatically since they use keywords  but it s actually simple with argument keyword unpacking  animals   pd DataFrame   kind     cat    dog    cat    dog                              height    9 1  6 0  9 5  34 0                             weight    7 9  7 5  9 9  198 0      agg dict          quot min height quot   pd NamedAgg column  height   aggfunc  min         quot max height quot   pd NamedAgg column  height   aggfunc  max         quot average weight quot   pd NamedAgg column  weight   aggfunc np mean     animals groupby  quot kind quot   agg   agg dict   The Result       min height  max height  average weight kind                                         cat          9 1         9 5            8 90 dog          6 0        34 0          102 75

User · Answer

New in version 0 25 0   To support column-specific aggregation with control over the output column names  pandas accepts the special syntax in GroupBy agg    known as    named aggregation     where   The keywords are the output column names  The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column  Pandas provides the pandas NamedAgg namedtuple with the fields   column    aggfunc   to make it clearer what the arguments are  As usual  the aggregation can be a callable or a string alias        In  79   animals   pd DataFrame   kind     cat    dog    cat    dog                                          height    9 1  6 0  9 5  34 0                                         weight    7 9  7 5  9 9  198 0                       In  80   animals     Out 80          kind  height  weight     0  cat     9 1     7 9     1  dog     6 0     7 5     2  cat     9 5     9 9     3  dog    34 0   198 0      In  81   animals groupby  kind   agg                   min height pd NamedAgg column  height   aggfunc  min                     max height pd NamedAgg column  height   aggfunc  max                     average weight pd NamedAgg column  weight   aggfunc np mean                                    Out 81              min height  max height  average weight     kind                                             cat          9 1         9 5            8 90     dog          6 0        34 0          102 75   pandas NamedAgg is just a namedtuple  Plain tuples are allowed as well       In  82   animals groupby  kind   agg                   min height   height    min                     max height   height    max                     average weight   weight   np mean                                    Out 82              min height  max height  average weight     kind                                             cat          9 1         9 5            8 90     dog          6 0        34 0          102 75   Additional keyword arguments are not passed through to the aggregation functions  Only pairs of  column  aggfunc  should be passed as   kwargs  If your aggregation functions requires additional arguments  partially apply them with functools partial     Named aggregation is also valid for Series groupby aggregations  In this case there   s no column selection  so the values are just the functions       In  84   animals groupby  kind   height agg                   min height  min                    max height  max                                    Out 84              min height  max height     kind                             cat          9 1         9 5     dog          6 0        34 0

User · Answer

Pandas  gt   0 25 0  named aggregations  Since pandas version 0 25 0 or higher  we are moving away from the dictionary based aggregation and renaming  and moving towards named aggregations which accepts a tuple  Now we can simultaneously aggregate   rename to a more informative column name   Example   df   pd DataFrame np random rand 4 4   columns list  abcd    df  group      0  0  1  1             a         b         c         d  group 0  0 521279  0 914988  0 054057  0 125668      0 1  0 426058  0 828890  0 784093  0 446211      0 2  0 363136  0 843751  0 184967  0 467351      1 3  0 241012  0 470053  0 358018  0 525032      1   Apply GroupBy agg with named aggregation   df groupby  group   agg               a sum   a    sum                 a mean   a    mean                 b mean   b    mean                 c sum   c    sum                 d range   d   lambda x  x max   - x min                 a sum    a mean    b mean     c sum   d range group                                                   0      0 947337  0 473668  0 871939  0 838150  0 320543 1      0 604149  0 302074  0 656902  0 542985  0 057681

User · Answer

As an alternative  mostly on aesthetics  to Ted Petrou s answer  I found I preferred a slightly more compact listing  Please don t consider accepting it  it s just a much-more-detailed comment on Ted s answer  plus code data  Python pandas is not my first best  but I found this to read well   df groupby  group        apply lambda x  pd Series          a sum          x  a   sum           a max          x  a   max           b mean         x  b   mean           c d prodsum     x  c     x  d    sum                     a sum     a max    b mean  c d prodsum group                                            0      0 530559  0 374540  0 553354     0 488525 1      1 433558  0 832443  0 460206     0 053313   I find it more reminiscent of dplyr pipes and data table chained commands  Not to say they re better  just more familiar to me   I certainly recognize the power and  for many  the preference of using more formalized def functions for these types of operations  This is just an alternative  not necessarily better      I generated data in the same manner as Ted  I ll add a seed for reproducibility   import numpy as np np random seed 42  df   pd DataFrame np random rand 4 4   columns list  abcd    df  group      0  0  1  1  df            a         b         c         d  group 0  0 374540  0 950714  0 731994  0 598658      0 1  0 156019  0 155995  0 058084  0 866176      0 2  0 601115  0 708073  0 020584  0 969910      1 3  0 832443  0 212339  0 181825  0 183405      1

User · Answer

Ted s answer is amazing  I ended up using a smaller version of that in case anyone is interested  Useful when you are looking for one aggregation that depends on values from multiple columns   create a dataframe  df pd DataFrame   a    1 2 3 4 5 6    b    1 1 0 1 1 0    c     x   x   y   y   z   z          a  b  c 0  1  1  x 1  2  1  x 2  3  0  y 3  4  1  y 4  5  1  z 5  6  0  z   grouping and aggregating with apply  using multiple columns   df groupby  c   apply lambda x  x  a    x  a   gt 1   amp   x  b    1   mean     c x    2 0 y    4 0 z    5 0   grouping and aggregating with aggregate  using multiple columns   I like this approach since I can still use aggregate  Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups   It seems obvious now  but as long as you don t select the column of interest directly after the groupby  you will have access to all the columns of the dataframe from within your aggregation function   only access to the selected column  df groupby  c    a   aggregate lambda x  x x gt 1  mean      access to all columns since selection is after all the magic  df groupby  c   aggregate lambda x  x  x  a   gt 1   amp   x  b    1   mean     a     or similarly  df groupby  c   aggregate lambda x  x  a    x  a   gt 1   amp   x  b    1   mean      I hope this helps

[python] Apply multiple functions to multiple groupby columns

Using `apply` and returning a Series

Examples related to python

Examples related to group-by

Examples related to aggregate-functions

Examples related to pandas

[python] Apply multiple functions to multiple groupby columns

Using apply and returning a Series

Examples related to python

Examples related to group-by

Examples related to aggregate-functions

Examples related to pandas

Using `apply` and returning a Series