How do I create a new column from the output of pandas groupby sum

Question

Trying to create a new column from the groupby calculation  In the code below  I get the correct calculated values for each date  see group below  but when I try to create a new column  df  Data4    with it I get NaN  So I am trying to create a new column in the dataframe with the sum of Data3 for the all dates and apply that to each date row  For example  2015-05-08 is in 2 rows  total is 50 5   55  and in this new column I would like to have 55 in both of the rows    import pandas as pd import numpy as np from pandas import DataFrame  df   pd DataFrame        Date      2015-05-08    2015-05-07    2015-05-06    2015-05-05    2015-05-08    2015-05-07    2015-05-06    2015-05-05          Sym       aapl    aapl    aapl    aapl    aaww    aaww    aaww    aaww          Data2    11  8  10  15  110  60  100  40        Data3    5  8  6  1  50  100  60  120      group   df  Data3   groupby df  Date    sum    df  Data4     group

User · Answer

How do I create a new column with Groupby().Sum()?

There are two ways - one straightforward and the other slightly more interesting.

Everybody's Favorite: `GroupBy.transform()` with `'sum'`

@Ed Chum's answer can be simplified, a bit. Call DataFrame.groupby rather than Series.groupby. This results in simpler syntax.

# The setup.
df[['Date', 'Data3']]

         Date  Data3
0  2015-05-08      5
1  2015-05-07      8
2  2015-05-06      6
3  2015-05-05      1
4  2015-05-08     50
5  2015-05-07    100
6  2015-05-06     60
7  2015-05-05    120

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64

It's a tad faster,

df2 = pd.concat([df] * 12345)

%timeit df2['Data3'].groupby(df['Date']).transform('sum')
%timeit df2.groupby('Date')['Data3'].transform('sum')

10.4 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.58 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Unconventional, but Worth your Consideration: `GroupBy.sum()` + `Series.map()`

I stumbled upon an interesting idiosyncrasy in the API. From what I tell, you can reproduce this on any major version over 0.20 (I tested this on 0.23 and 0.24). It seems like you consistently can shave off a few milliseconds of the time taken by transform if you instead use a direct function of GroupBy and broadcast it using map:

df.Date.map(df.groupby('Date')['Data3'].sum())

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Date, dtype: int64

Compare with

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64

My tests show that map is a bit faster if you can afford to use the direct GroupBy function (such as mean, min, max, first, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data.

(Left: v0.23, Right: v0.24)

Nice alternative to know, and better if you have smaller frames with smaller numbers of groups. . . but I would recommend transform as a first choice. Thought this was worth sharing anyway.

Benchmarking code, for reference:

import perfplot

perfplot.show(
    setup=lambda n: pd.DataFrame({'A': np.random.choice(n//10, n), 'B': np.ones(n)}),
    kernels=[
        lambda df: df.groupby('A')['B'].transform('sum'),
        lambda df:  df.A.map(df.groupby('A')['B'].sum()),
    ],
    labels=['GroupBy.transform', 'GroupBy.sum + map'],
    n_range=[2**k for k in range(5, 20)],
    xlabel='N',
    logy=True,
    logx=True
)

User · Answer

You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column   In  74    df   pd DataFrame   Date     2015-05-08    2015-05-07    2015-05-06    2015-05-05    2015-05-08    2015-05-07    2015-05-06    2015-05-05     Sym     aapl    aapl    aapl    aapl    aaww    aaww    aaww    aaww     Data2    11  8  10  15  110  60  100  40   Data3    5  8  6  1  50  100  60  120      df  Data4     df  Data3   groupby df  Date    transform  sum   df Out 74      Data2  Data3        Date   Sym  Data4 0     11      5  2015-05-08  aapl     55 1      8      8  2015-05-07  aapl    108 2     10      6  2015-05-06  aapl     66 3     15      1  2015-05-05  aapl    121 4    110     50  2015-05-08  aaww     55 5     60    100  2015-05-07  aaww    108 6    100     60  2015-05-06  aaww     66 7     40    120  2015-05-05  aaww    121

[python] How do I create a new column from the output of pandas groupby().sum()?

How do I create a new column with Groupby().Sum()?

Everybody's Favorite: `GroupBy.transform()` with `'sum'`

Unconventional, but Worth your Consideration: `GroupBy.sum()` + `Series.map()`

Examples related to python

Examples related to pandas

Examples related to group-by

Examples related to pandas-groupby

[python] How do I create a new column from the output of pandas groupby().sum()?

How do I create a new column with Groupby().Sum()?

Everybody's Favorite: GroupBy.transform() with 'sum'

Unconventional, but Worth your Consideration: GroupBy.sum() + Series.map()

Examples related to python

Examples related to pandas

Examples related to group-by

Examples related to pandas-groupby

Everybody's Favorite: `GroupBy.transform()` with `'sum'`

Unconventional, but Worth your Consideration: `GroupBy.sum()` + `Series.map()`