[python] Pandas percentage of total with groupby

This is obviously simple, but as a numpy newbe I'm getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.

I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': range(1, 7) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

This returns:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.

This question is related to python pandas

The answer is


I think this would do the trick in 1 line:

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

I think this needs benchmarking. Using OP's original DataFrame,

df = pd.DataFrame({
    'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
    'office_id': range(1, 7) * 2,
    'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})

1st Andy Hayden

As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.

c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()

3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)


2nd Paul H

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100

4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)


3rd exp1orer

This is the slowest answer as it calculates x.sum() for each x in level 0.

For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).

Here is the modification,

(
    df.groupby(['state', 'office_id'])
    .agg({'sales': 'sum'})
    .groupby(level=0)
    .apply(lambda x: 100 * x / float(x.sum()))
)

10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)


So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.

Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,

import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
    ''.join(i) for i in zip(
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                       )
]

df = pd.DataFrame({'state': groups * 400,
               'office_id': list(range(1, 601)) * 20000,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)] * 1000000
})

Using Andy's,

2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

and exp1orer

19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

So now we see x10 speed up on large, high cardinality datasets.


Be sure to UV these three answers if you UV this one!!


You can sum the whole DataFrame and divide by the state total:

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

Returns

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

TypeError: unsupported operand type(s) for /: 'str' and 'str'


As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of @exp1orer's accepted answer

With the df, I'll call it by the alias state_office_sales:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).

In:   state_total_sales = df.groupby(level=0).sum()
      state_total_sales

Out: 
       sales
state   
AZ     2448009
CA     2832270
CO     1495486
WA     595859

Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:

In:   state_office_sales / state_total_sales

Out:  

                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          0.288022
        3          0.322169
        5          0.389809
CO      1          0.206684
        3          0.357891
        5          0.435425
WA      2          0.321689
        4          0.346325
        6          0.331986

To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:

In:   partial_total = pd.DataFrame(
                      data   =  {'sales' : [2448009, 595859, 99999]},
                      index  =             ['AZ',    'WA',   'XX' ]
                      )
      partial_total.index.name = 'state'


Out:  
         sales
state
AZ       2448009
WA       595859
XX       99999
In:   state_office_sales / partial_total

Out: 
                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          NaN
        3          NaN
        5          NaN
CO      1          NaN
        3          NaN
        5          NaN
WA      2          0.321689
        4          0.346325
        6          0.331986

This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.

In:   missing_index_totals = state_total_sales.rename_axis("")
      missing_index_totals

Out:  
       sales
AZ     2448009
CA     2832270
CO     1495486
WA     595859
In:   state_office_sales / missing_index_totals 

Out:  ValueError: cannot join with no overlapping index names

I realize there are already good answers here.

I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.

It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).

The following snippet fulfills these criteria:

df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())

Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).


Simple way I have used is a merge after the 2 groupby's then doing simple division.

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)]})

grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()

Returns:

sales
state   office_id   
AZ  2   54.587910
    4   33.009225
    6   12.402865
CA  1   32.046582
    3   44.937684
    5   23.015735
CO  1   21.099989
    3   31.848658
    5   47.051353
WA  2   43.882790
    4   10.265275
    6   45.851935

(This solution is inspired from this article https://pbpython.com/pandas_transform.html)

I find the following solution to be the simplest(and probably the fastest) using transformation:

Transformation: While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input.

So using transformation, the solution is 1-liner:

df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')

And if you print:

print(df.sort_values(['state', 'office_id']).reset_index(drop=True))

   state  office_id   sales          %
0     AZ          2  195197   9.844309
1     AZ          4  877890  44.274352
2     AZ          6  909754  45.881339
3     CA          1  614752  50.415708
4     CA          3  395340  32.421767
5     CA          5  209274  17.162525
6     CO          1  549430  42.659629
7     CO          3  457514  35.522956
8     CO          5  280995  21.817415
9     WA          2  828238  35.696929
10    WA          4  719366  31.004563
11    WA          6  772590  33.298509

For conciseness I'd use the SeriesGroupBy:

In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")

In [12]: c
Out[12]:
state  office_id
AZ     2            925105
       4            592852
       6            362198
CA     1            819164
       3            743055
       5            292885
CO     1            525994
       3            338378
       5            490335
WA     2            623380
       4            441560
       6            451428
Name: count, dtype: int64

In [13]: c / c.groupby(level=0).sum()
Out[13]:
state  office_id
AZ     2            0.492037
       4            0.315321
       6            0.192643
CA     1            0.441573
       3            0.400546
       5            0.157881
CO     1            0.388271
       3            0.249779
       5            0.361949
WA     2            0.411101
       4            0.291196
       6            0.297703
Name: count, dtype: float64

For multiple groups you have to use transform (using Radical's df):

In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")

In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1  Group 2  Final Group
AAHQ     BOSC     OWON           0.331006
                  TLAM           0.668994
         MQVF     BWSI           0.288961
                  FXZM           0.711039
         ODWV     NFCH           0.262395
...
Name: count, dtype: float64

This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).


You need to make a second groupby object that groups by the states, and then use the div method:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.


The most elegant way to find percentages across columns or index is to use pd.crosstab.

Sample Data

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

The output dataframe is like this

print(df)

        state   office_id   sales
    0   CA  1   764505
    1   WA  2   313980
    2   CO  3   558645
    3   AZ  4   883433
    4   CA  5   301244
    5   WA  6   752009
    6   CO  1   457208
    7   AZ  2   259657
    8   CA  3   584471
    9   WA  4   122358
    10  CO  5   721845
    11  AZ  6   136928

Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.

result = pd.crosstab(index=df['state'], 
                     columns=df['office_id'], 
                     values=df['sales'], 
                     aggfunc='sum', 
                     normalize='index').applymap('{:.2f}%'.format)




print(result)
office_id   1   2   3   4   5   6
state                       
AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%

I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:

Create the test dataframe with 50,000 unique groups

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

When grouped it looks like:

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

Array method of finding percentage:

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

This method takes about ~0.15 seconds

Top answer method (using lambda function):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

This method takes about ~21 seconds to produce the same result.

The result:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494

One-line solution:

df.join(
    df.groupby('state').agg(state_total=('sales', 'sum')),
    on='state'
).eval('sales / state_total')

This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.