Naming returned columns in Pandas aggregate function

Question

I m having trouble with Pandas  groupby functionality  I ve read the documentation  but I can t see to figure out how to apply aggregate functions to multiple columns and have custom names for those columns   This comes very close  but the data structure returned has nested column headings   data groupby  Country   agg            column1     foo   sum      column2     mean   np mean   std   np std       ie  I want to take the mean and std of column2  but return those columns as  mean  and  std    What am I missing

User · Accepted Answer

This will drop the outermost level from the hierarchical column index:

df = data.groupby(...).agg(...)
df.columns = df.columns.droplevel(0)

If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:

df.columns = ["_".join(x) for x in df.columns.ravel()]

For example:

import pandas as pd
import pandas.rpy.common as com
import numpy as np

data = com.load_data('Loblolly')
print(data.head())
#     height  age Seed
# 1     4.51    3  301
# 15   10.89    5  301
# 29   28.72   10  301
# 43   41.74   15  301
# 57   52.70   20  301

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
print(df.head())
#       age     height           
#       sum        std       mean
# Seed                           
# 301    78  22.638417  33.246667
# 303    78  23.499706  34.106667
# 305    78  23.927090  35.115000
# 307    78  22.222266  31.328333
# 309    78  23.132574  33.781667

df.columns = df.columns.droplevel(0)
print(df.head())

yields

      sum        std       mean
Seed                           
301    78  22.638417  33.246667
303    78  23.499706  34.106667
305    78  23.927090  35.115000
307    78  22.222266  31.328333
309    78  23.132574  33.781667

Alternatively, to keep the first level of the index:

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
df.columns = ["_".join(x) for x in df.columns.ravel()]

yields

      age_sum   height_std  height_mean
Seed                           
301        78    22.638417    33.246667
303        78    23.499706    34.106667
305        78    23.927090    35.115000
307        78    22.222266    31.328333
309        78    23.132574    33.781667

User · Answer

such as this kind of dataframe  there are two levels of thecolumn name    shop id  item id   date block num item cnt day                                          target               0   0       30          1            31                  we can use this code   df columns    col 0  if col -1      else col -1  for col in df columns values   result is    shop id  item id   date block num target               0   0       30          1            31

User · Answer

I agree with the OP that it seems more natural and consistent to name and define the output columns in the same place  e g  as is done with tidyverse s summarize in R   but a work-around in pandas for now is to create the new columns with desired names via assign before doing the aggregation   data assign      f data  column1        mean data  column2        std data  column2     groupby  Country   agg dict f sum  mean np mean  std np std   reset index      Using reset index turns  Country    f    mean   and  std  all into regular columns with a separate integer index

User · Answer

If you want to have a behavior similar to JMP  creating column titles that keep all info from the multi index you can use   newidx      for  n1 n2  in df columns ravel        newidx append   s- s     n1 n2   df columns newidx   It will change your dataframe from       I                       V     mean        std         first V 4200 0  25 499536   31 557133   4200 0 4300 0  25 605662   31 678046   4300 0 4400 0  26 679005   32 919996   4400 0 4500 0  26 786458   32 811633   4500 0   to      I-mean      I-std       V-first V 4200 0  25 499536   31 557133   4200 0 4300 0  25 605662   31 678046   4300 0 4400 0  26 679005   32 919996   4400 0 4500 0  26 786458   32 811633   4500 0

User · Answer

For pandas    0 25  The functionality to name returned aggregate columns has been reintroduced in the master branch and is targeted for pandas 0 25  The new syntax is  agg new col name   col name    agg func    Detailed example from the PR linked above   In  2   df   pd DataFrame   kind     cat    dog    cat    dog                                height    9 1  6 0  9 5  34 0                               weight    7 9  7 5  9 9  198 0             In  3   df Out 3     kind  height  weight 0  cat     9 1     7 9 1  dog     6 0     7 5 2  cat     9 5     9 9 3  dog    34 0   198 0  In  4   df groupby  kind   agg min height   height    min                                    max weight   weight    max    Out 4         min height  max weight kind cat          9 1         9 9 dog          6 0       198 0   It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier  below  as per this PR  Again  copying from the example in the PR   In  2   df   pd DataFrame   A     a    a     B    1  2    C    3  4     In  3   df groupby  A   agg   B    lambda x  0  lambda x  1    Out 3            B    lt lambda gt   lt lambda 1 gt  A a        0          1   and then  rename    or in one go   In  4   df groupby  A   agg b   B   lambda x  0   c   B   lambda x  1   Out 4      b  c A a  0  0     For pandas  lt  0 25  The currently accepted answer by unutbu describes are great way of doing this in pandas versions  lt   0 20  However  as of pandas 0 20  using this method raises a warning indicating that the syntax will not be available in future versions of pandas   Series      FutureWarning  using a dict on a Series for aggregation is deprecated and will be removed in a future version   DataFrames      FutureWarning  using a dict with renaming is deprecated and will be removed in a future version   According to the pandas 0 20 changelog  the recommended way of renaming columns while aggregating is as follows     Create a sample data frame df   pd DataFrame   A    1  1  1  2  2                       B   range 5                       C   range 5            SINGLE COLUMN  SERIES         Syntax soon to be deprecated df groupby  A   B agg   foo    count      Recommended replacement syntax df groupby  A   B agg   count    rename columns   count    foo            MULTI COLUMN        Syntax soon to be deprecated df groupby  A   agg   B     foo    sum     C     bar    min       Recommended replacement syntax df groupby  A   agg   B    sum    C    min    rename columns   B    foo    C    bar      As the recommended syntax is more verbose  parentheses can   be used to introduce line breaks and increase readability  df groupby  A        agg   B    sum    C    min         rename columns   B    foo    C    bar        Please see the 0 20 changelog for additional details   Update 2017-01-03 in response to  JunkMechanic s comment   With the old style dictionary syntax  it was possible to pass multiple lambda functions to  agg  since these would be renamed with the key in the passed dictionary    gt  gt  gt  df groupby  A   agg   B     min   lambda x  x min     max   lambda x  x max           B       max min A         1   2   0 2   4   3   Multiple functions can also be passed to a single column as a list    gt  gt  gt  df groupby  A   agg   B    np min  np max          B        amin amax A           1    0    2 2    3    4   However  this does not work with lambda functions  since they are anonymous and all return  lt lambda gt   which causes a name collision    gt  gt  gt  df groupby  A   agg   B    lambda x  x min    lambda x  x max    SpecificationError  Function names must be unique  found multiple named  lt lambda gt    To avoid the SpecificationError  named functions can be defined a priori instead of using lambda  Suitable function names also avoid calling  rename on the data frame afterwards  These functions can be passed with the same list syntax as above    gt  gt  gt  def my min x    gt  gt  gt      return x min     gt  gt  gt  def my max x    gt  gt  gt      return x max     gt  gt  gt  df groupby  A   agg   B    my min  my max            B          my min my max A               1      0      2 2      3      4

User · Answer

With the inspiration of  Joel Ostblom  For those who already have a workable dictionary for merely aggregation  you can use modify the following code for the newer version aggregation  separating aggregation and renaming part  Please be aware of the nested dictionary if there are more than 1 item   def agg translate agg rename input agg dict       agg dict          rename dict          for k  v in input agg dict items            if len v     1              agg dict k    list v values    0              rename dict k    list v keys    0          else              updated index   1             for nested dict k  nested dict v in v items                    modified key   k         str updated index                  agg dict modified key    nested dict v                 rename dict modified key    nested dict k                 updated index    1     return agg dict  rename dict  one dict     column1     foo    sum     column2     mean    mean    std    std    agg  rename   agg translator aa one dict    We get  agg     column1    sum    column2 1    mean    column2 2    std   rename     column1    foo    column2 1    mean    column2 2    std     Please let me know if there is a smarter way to do it  Thanks

[python] Naming returned columns in Pandas aggregate function?

Examples related to python

Examples related to group-by

Examples related to pandas

Examples related to aggregate-functions