Pandas - How to flatten a hierarchical index in columns

Question

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

I want to flatten it, so that it looks like this (names aren't critical - I could rename):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

How do I do this? (I've tried a lot, to no avail.)

Per a suggestion, here is the head in dict form

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

User · Answer

Another simple routine    def flatten columns df  sep           def  remove empty column name           return tuple element for element in column name if element      def  join column name           return sep join column name       new columns     join  remove empty column   for column in df columns values      df columns   new columns

User · Answer

A bit late maybe  but if you are not worried about duplicate column names   df columns   df columns tolist

User · Answer

Following  jxstanford and  tvt173  I wrote a quick function which should do the trick  regardless of string int column names   def flatten cols df       df columns                 join tuple map str  t    rstrip               for t in df columns values               return df

User · Answer

You could also do as below  Consider df to be your dataframe and assume a two level index  as is the case in your example   df columns     df columns i  0        datadf pos4 columns i  1   for i in range len df columns

User · Answer

A general solution that handles multiple levels and mixed types   df columns        join tuple map str  t    for t in df columns values

User · Answer

The easiest and most intuitive solution for me was to combine the column names using get_level_values. This prevents duplicate column names when you do more than one aggregation on the same column:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

If you want a separator between columns, you can do this. This will return the same thing as Seiji Armstrong's comment on the accepted answer that only includes underscores for columns with values in both index levels:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

I know this does the same thing as Andy Hayden's great answer above, but I think it is a bit more intuitive this way and is easier to remember (so I don't have to keep referring to this thread), especially for novice pandas users.

This method is also more extensible in the case where you may have 3 column levels.

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three

User · Answer

I ll share a straight-forward way that worked for me        join  str elem  for elem in tup   for tup in df columns tolist     df   df reset index   if needed

User · Answer

The most pythonic way to do this to use map function.

df.columns = df.columns.map(' '.join).str.strip()

Output print(df.columns):

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

Update using Python 3.6+ with f string:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

Output:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

User · Answer

pd DataFrame df to records      multiindex become columns and new index is integers only

User · Answer

I think the easiest way to do this would be to set the columns to the top level   df columns   df columns get level values 0    Note  if the to level has a name you can also access it by this  rather than 0      If you want to combine join your MultiIndex into one Index  assuming you have just string entries in your columns  you could   df columns        join col  strip   for col in df columns values    Note  we must strip the whitespace for when there is no second index   In  11        join col  strip   for col in df columns values  Out 11      USAF     WBAN     day     month     s CD sum     s CL sum     s CNT sum     s PC sum     tempf amax     tempf amin     year

User · Answer

All of the current answers on this thread must have been a bit dated. As of pandas version 0.24.0, the .to_flat_index() does what you need.

From panda's own documentation:

MultiIndex.to_flat_index()

Convert a MultiIndex to an Index of Tuples containing the level values.

A simple example from its documentation:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

Applying to_flat_index():

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

Using it to replace existing `pandas` column

An example of how you'd use it on dat, which is a DataFrame with a MultiIndex column:

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

User · Answer

To flatten a MultiIndex inside a chain of other DataFrame methods, define a function like this:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

Then use the pipe method to apply this function in the chain of DataFrame methods, after groupby and agg but before any other methods in the chain:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

User · Answer

In case you want to have a separator in the name between levels  this function works well   def flattenHierarchicalCol col sep             if not type col  is tuple          return col     else          new col              for leveli level in enumerate col               if not level                        if not leveli    0                      new col    sep                 new col    level         return new col  df columns   df columns map flattenHierarchicalCol

User · Answer

After reading through all the answers  I came up with this   def   my flatten cols self  how     join  reset index True       how    lambda iter  list iter  -1   if how     last  else how     self columns    how filter None  map str  levels    for levels in self columns values                        if isinstance self columns  pd MultiIndex  else self columns     return self reset index   if reset index else self pd DataFrame my flatten cols     my flatten cols   Usage   Given a data frame   df   pd DataFrame   grouper     x   x   y   y     val1    0 2 4 6   2   1 3 5 7    columns   grouper    val1   2      grouper  val1  2 0       x     0  1 1       x     2  3 2       y     4  5 3       y     6  7    Single aggregation method  resulting variables named the same as source   df groupby by  grouper   agg  min   my flatten cols      Same as df groupby by  grouper   as index False  or  agg      reset index   ----- before -----            val1  2   grouper           ------ after -----   grouper  val1  2 0       x     0  1 1       y     4  5   Single source variable  multiple aggregations  resulting variables named after statistics   df groupby by  grouper   agg   val1    min max    my flatten cols  last      Same as a   df groupby     agg      a columns   a columns droplevel 0   a reset index    ----- before -----             val1                min max   grouper           ------ after -----   grouper  min  max 0       x    0    2 1       y    4    6   Multiple variables  multiple aggregations  resulting variables named  varname   statname    df groupby by  grouper   agg   val1   min  2  sum   size     my flatten cols     you can combine the names in other ways too  e g  use a different delimiter   df groupby by  grouper   agg   val1   min  2  sum   size     my flatten cols     join     Runs a columns        join filter None  map str  levels    for levels in a columns values  under the hood  since this form of agg   results in MultiIndex on columns   If you don t have the my flatten cols helper  it might be easier to type in the solution suggested by  Seigi  a columns        join t  rstrip      for t in a columns values   which works similarly in this case  but fails if you have numeric labels on columns  To handle the numeric labels on columns  you could use the solution suggested by  jxstanford and  Nolan Conaway  a columns        join tuple map str  t    rstrip      for t in a columns values    but I don t understand why the tuple   call is needed  and I believe rstrip   is only required if some columns have a descriptor like   colname        which can happen if you reset index   before trying to fix up  columns  ----- before -----            val1           2                 min       sum    size   grouper                ------ after -----   grouper  val1 min  2 sum  2 size 0       x         0      4       2 1       y         4     12       2   You want to name the resulting variables manually   this is deprecated since pandas 0 20 0 with no adequate alternative as of 0 23   df groupby by  grouper   agg   val1     sum of val1    sum    count of val1    count                                       2    sum of 2       sum    count of 2       count     my flatten cols  last      Other suggestions include  setting the columns manually  res columns     A sum    B sum    count   or  join  ing multiple groupby statements  ----- before -----                    val1                      2                    count of val1 sum of val1 count of 2 sum of 2   grouper                                                ------ after -----   grouper  count of val1  sum of val1  count of 2  sum of 2 0       x              2            2           2         4 1       y              2           10           2        12     Cases handled by the helper function   level names can be non-string  e g  Index pandas DataFrame by column numbers  when column names are integers  so we have to convert with map str      they can also be empty  so we have to filter None      for single-level columns  i e  anything except MultiIndex   columns values returns the names  str  not tuples  depending on how you used  agg   you may need to keep the bottom-most label for a column or concatenate multiple labels  since I m new to pandas   more often than not  I want reset index   to be able to work with the group-by columns in the regular way  so it does that by default

User · Answer

df columns        join tup  rstrip      for tup in df columns values

User · Answer

Andy Hayden s answer is certainly the easiest way -- if you want to avoid duplicate column labels you need to tweak a bit  In  34   df Out 34         USAF   WBAN  day  month  s CD  s CL  s CNT  s PC  tempf         year                                sum   sum    sum   sum   amax   amin       0  702730  26451    1      1    12     0     13     1  30 92  24 98  1993 1  702730  26451    2      1    13     0     13     0  32 00  24 98  1993 2  702730  26451    3      1     2    10     13     1  23 00   6 98  1993 3  702730  26451    4      1    12     0     13     1  10 04   3 92  1993 4  702730  26451    5      1    10     0     13     3  19 94  10 94  1993   In  35   mi   df columns  In  36   mi Out 36    MultiIndex   USAF      WBAN      day      month      s CD  sum    s CL  sum    s CNT  sum    s PC  sum    tempf  amax    tempf  amin    year       In  37   mi tolist   Out 37       USAF           WBAN           day           month           s CD    sum       s CL    sum       s CNT    sum       s PC    sum       tempf    amax       tempf    amin       year         In  38   ind   pd Index  e 0    e 1  for e in mi tolist      In  39   ind Out 39   Index  USAF  WBAN  day  month  s CDsum  s CLsum  s CNTsum  s PCsum  tempfamax  tempfamin  year   dtype object   In  40   df columns   ind     In  46   df Out 46             USAF    WBAN   day   month   s CDsum   s CLsum   s CNTsum   s PCsum   tempfamax   tempfamin     0   702730   26451      1         1          12            0            13            1         30 92         24 98      1   702730   26451      2         1          13            0            13            0         32 00         24 98      2   702730   26451      3         1            2          10            13            1         23 00          6 98      3   702730   26451      4         1          12            0            13            1         10 04          3 92      4   702730   26451      5         1          10            0            13            3         19 94         10 94               year    0   1993    1   1993    2   1993    3   1993    4   1993

User · Answer

And if you want to retain any of the aggregation info from the second level of the multiindex you can try this   In  1   new cols       join t  for t in df columns  Out 1     USAF     WBAN     day     month     s CDsum     s CLsum     s CNTsum     s PCsum     tempfamax     tempfamin     year    In  2   df columns   new cols

[python] Pandas - How to flatten a hierarchical index in columns

The answer is

Update using Python 3.6+ with f string:

Using it to replace existing `pandas` column

Usage:

Cases handled by the helper function

Examples related to python

Examples related to pandas

Examples related to dataframe

Tags

[python] Pandas - How to flatten a hierarchical index in columns

The answer is

Update using Python 3.6+ with f string:

Using it to replace existing pandas column

Usage:

Cases handled by the helper function

Examples related to python

Examples related to pandas

Examples related to dataframe

Tags

Using it to replace existing `pandas` column