Pandas percentage of total with groupby

Question

This is obviously simple  but as a numpy newbe I m getting stuck   I have a CSV file that contains 3 columns  the State  the Office ID  and the Sales for that office   I want to calculate the percentage of sales per office in a given state  total of all percentages in each state is 100     df   pd DataFrame   state     CA    WA    CO    AZ     3                      office id   range 1  7    2                      sales    np random randint 100000  999999                               for   in range 12      df groupby   state    office id    agg   sales    sum      This returns                     sales state office id         AZ    2          839507       4          373917       6          347225 CA    1          798585       3          890850       5          454423 CO    1          819975       3          202969       5          614011 WA    2          163942       4          369858       6          959285   I can t seem to figure out how to  reach up  to the state level of the groupby to total up the sales for the entire state to calculate the fraction

User · Answer

I think this needs benchmarking  Using OP s original DataFrame   df   pd DataFrame        state     CA    WA    CO    AZ     3       office id   range 1  7    2       sales    np random randint 100000  999999  for   in range 12          1st Andy Hayden  As commented on his answer  Andy takes full advantage of vectorisation and pandas indexing   c   df groupby   state    office id     sales   sum   rename  count   c   c groupby level 0  sum     3 42 ms    16 7   s per loop  mean    std  dev  of 7 runs  100 loops each     2nd Paul H  state office   df groupby   state    office id    agg   sales    sum    state   df groupby   state    agg   sales    sum    state office div state  level  state     100   4 66 ms    24 4   s per loop  mean    std  dev  of 7 runs  100 loops each     3rd exp1orer  This is the slowest answer as it calculates x sum   for each x in level 0   For me  this is still a useful answer  though not in its current form  For quick EDA on smaller datasets  apply allows you use method chaining to write this in a single line  We therefore remove the need decide on a variable s name  which is actually very computationally expensive for your most valuable resource  your brain      Here is the modification         df groupby   state    office id         agg   sales    sum         groupby level 0       apply lambda x  100   x   float x sum         10 6 ms    81 5   s per loop  mean    std  dev  of 7 runs  100 loops each     So no one is going care about 6ms on a small dataset  However  this is 3x speed up and  on a larger dataset with high cardinality groupbys this is going to make a massive difference   Adding to the above code  we make a DataFrame with shape  12 000 000  3  with 14412 state categories and 600 office ids   import string  import numpy as np import pandas as pd np random seed 0   groups            join i  for i in zip      np random choice np array  i for i in string ascii lowercase    30000       np random choice np array  i for i in string ascii lowercase    30000       np random choice np array  i for i in string ascii lowercase    30000                               df   pd DataFrame   state   groups   400                  office id   list range 1  601     20000                  sales    np random randint 100000  999999                           for   in range 12     1000000      Using Andy s   2 s    10 4 ms per loop  mean    std  dev  of 7 runs  1 loop each   and exp1orer  19 s    77 1 ms per loop  mean    std  dev  of 7 runs  1 loop each   So now we see x10 speed up on large  high cardinality datasets     Be sure to UV these three answers if you UV this one

User · Answer

You can sum the whole DataFrame and divide by the state total     Copying setup from Paul H answer import numpy as np import pandas as pd np random seed 0  df   pd DataFrame   state     CA    WA    CO    AZ     3                  office id   list range 1  7     2                  sales    np random randint 100000  999999  for   in range 12       Add a column with the sales divided by state total sales  df  sales ratio      df   df groupby   state    transform sum    sales    df   Returns      office id   sales state  sales ratio 0           1  405711    CA     0 193319 1           2  535829    WA     0 347072 2           3  217952    CO     0 198743 3           4  252315    AZ     0 192500 4           5  982371    CA     0 468094 5           6  459783    WA     0 297815 6           1  404137    CO     0 368519 7           2  222579    AZ     0 169814 8           3  710581    CA     0 338587 9           4  548242    WA     0 355113 10          5  474564    CO     0 432739 11          6  835831    AZ     0 637686   But note that this only works because all columns other than state are numeric  enabling summation of the entire DataFrame  For example  if office id is character instead  you get an error   df office id   df office id astype str  df  sales ratio      df   df groupby   state    transform sum    sales        TypeError  unsupported operand type s  for     str  and  str

User · Answer

Paul H s answer is right that you will have to make a second groupby object  but you can calculate the percentage in a simpler way -- just groupby the state office and divide the sales column by its sum  Copying the beginning of Paul H s answer     From Paul H import numpy as np import pandas as pd np random seed 0  df   pd DataFrame   state     CA    WA    CO    AZ     3                      office id   list range 1  7     2                      sales    np random randint 100000  999999                               for   in range 12     state office   df groupby   state    office id    agg   sales    sum      Change  groupby state office and divide by sum state pcts   state office groupby level 0  apply lambda x                                                   100   x   float x sum       Returns                        sales state office id            AZ    2          16 981365       4          19 250033       6          63 768601 CA    1          19 331879       3          33 858747       5          46 809373 CO    1          36 851857       3          19 874290       5          43 273852 WA    2          34 707233       4          35 511259       6          29 781508

User · Answer

One-line solution   df join      df groupby  state   agg state total   sales    sum         on  state    eval  sales   state total     This returns a Series of per-office ratios -- can be used on it s own or assigned to the original Dataframe

User · Answer

df   pd DataFrame   state     CA    WA    CO    AZ     3                  office id   list range 1  7     2                  sales    np random randint 100000  999999                           for   in range 12      grouped   df groupby   state    office id    100 grouped sum   df   state   sales    groupby  state   sum     Returns   sales state   office id    AZ  2   54 587910     4   33 009225     6   12 402865 CA  1   32 046582     3   44 937684     5   23 015735 CO  1   21 099989     3   31 848658     5   47 051353 WA  2   43 882790     4   10 265275     6   45 851935

User · Answer

I realize there are already good answers here    I nevertheless would like to contribute my own  because I feel for an elementary  simple question like this  there should be a short solution that is understandable at a glance    It should also work in a way that I can add the percentages as a new column  leaving the rest of the dataframe untouched  Last but not least  it should generalize in an obvious way to the case in which there is more than one grouping level  e g   state and country instead of only state    The following snippet fulfills these criteria   df  sales ratio     df groupby   state     sales   transform lambda x  x x sum      Note that if you re still using Python 2  you ll have to replace the x in the denominator of the lambda term by float x

User · Answer

I think this would do the trick in 1 line   df groupby   state    office id    sum   transform lambda x  x np sum x  100

User · Answer

This solution is inspired from this article https   pbpython com pandas transform html   I find the following solution to be the simplest and probably the fastest  using transformation      Transformation  While aggregation must return a reduced version of the   data  transformation can return some transformed version of the full   data to recombine  For such a transformation  the output is the same   shape as the input    So using transformation  the solution is 1-liner   df        100   df  sales     df groupby  state    sales   transform  sum     And if you print   print df sort values   state    office id    reset index drop True       state  office id   sales            0     AZ          2  195197   9 844309 1     AZ          4  877890  44 274352 2     AZ          6  909754  45 881339 3     CA          1  614752  50 415708 4     CA          3  395340  32 421767 5     CA          5  209274  17 162525 6     CO          1  549430  42 659629 7     CO          3  457514  35 522956 8     CO          5  280995  21 817415 9     WA          2  828238  35 696929 10    WA          4  719366  31 004563 11    WA          6  772590  33 298509

User · Answer

The most elegant way to find percentages across columns or index is to use pd crosstab   Sample Data  df   pd DataFrame   state     CA    WA    CO    AZ     3                  office id   list range 1  7     2                  sales    np random randint 100000  999999  for   in range 12       The output dataframe is like this  print df           state   office id   sales     0   CA  1   764505     1   WA  2   313980     2   CO  3   558645     3   AZ  4   883433     4   CA  5   301244     5   WA  6   752009     6   CO  1   457208     7   AZ  2   259657     8   CA  3   584471     9   WA  4   122358     10  CO  5   721845     11  AZ  6   136928   Just specify the index  columns and the values to aggregate  The normalize keyword will calculate   across index or columns depending upon the context   result   pd crosstab index df  state                          columns df  office id                          values df  sales                          aggfunc  sum                         normalize  index   applymap     2f    format      print result  office id   1   2   3   4   5   6 state                        AZ  0 00    0 20    0 00    0 69    0 00    0 11  CA  0 46    0 00    0 35    0 00    0 18    0 00  CO  0 26    0 00    0 32    0 00    0 42    0 00  WA  0 00    0 26    0 00    0 10    0 00    0 63

User · Answer

As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes  Namely in how the operation works by automatically matching up column and index names  This code should be equivalent to a step by step version of  exp1orer s accepted answer  With the df  I ll call it by the alias state office sales                     sales state office id         AZ    2          839507       4          373917       6          347225 CA    1          798585       3          890850       5          454423 CO    1          819975       3          202969       5          614011 WA    2          163942       4          369858       6          959285   state total sales is state office sales grouped by total sums in index level 0  leftmost    In    state total sales   df groupby level 0  sum         state total sales  Out          sales state    AZ     2448009 CA     2832270 CO     1495486 WA     595859   Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like   In    state office sales   state total sales  Out                        sales state   office id    AZ      2          0 448640         4          0 125865         6          0 425496 CA      1          0 288022         3          0 322169         5          0 389809 CO      1          0 206684         3          0 357891         5          0 435425 WA      2          0 321689         4          0 346325         6          0 331986   To illustrate this even better  here is a partial total with a XX that has no equivalent  Pandas will match the location based on index and column names  where there is no overlap pandas will ignore it   In    partial total   pd DataFrame                        data        sales     2448009  595859  99999                          index                  AZ       WA      XX                                  partial total index name    state    Out             sales state AZ       2448009 WA       595859 XX       99999   In    state office sales   partial total  Out                      sales state   office id    AZ      2          0 448640         4          0 125865         6          0 425496 CA      1          NaN         3          NaN         5          NaN CO      1          NaN         3          NaN         5          NaN WA      2          0 321689         4          0 346325         6          0 331986   This becomes very clear when there are no shared indexes or columns  Here missing index totals is equal to state total sales except that it has a no index-name    In    missing index totals   state total sales rename axis           missing index totals  Out           sales AZ     2448009 CA     2832270 CO     1495486 WA     595859   In    state office sales   missing index totals   Out   ValueError  cannot join with no overlapping index names

User · Answer

Simple way I have used is a merge after the 2 groupby s then doing simple division   import numpy as np import pandas as pd np random seed 0  df   pd DataFrame   state     CA    WA    CO    AZ     3                  office id   list range 1  7     2                  sales    np random randint 100000  999999  for   in range 12      state office   df groupby   state    office id     sales   sum   reset index   state   df groupby   state     sales   sum   reset index   state office   state office merge state  left on  state   right on   state   how    left   state office  sales ratio     100  state office  sales x   state office  sales y        state  office id  sales x  sales y  sales ratio 0     AZ          2   222579  1310725    16 981365 1     AZ          4   252315  1310725    19 250033 2     AZ          6   835831  1310725    63 768601 3     CA          1   405711  2098663    19 331879 4     CA          3   710581  2098663    33 858747 5     CA          5   982371  2098663    46 809373 6     CO          1   404137  1096653    36 851857 7     CO          3   217952  1096653    19 874290 8     CO          5   474564  1096653    43 273852 9     WA          2   535829  1543854    34 707233 10    WA          4   548242  1543854    35 511259 11    WA          6   459783  1543854    29 781508

User · Answer

I know that this is an old question  but exp1orer s answer is very slow for datasets with a large number unique groups  probably because of the lambda    I built off of their answer to turn it into an array calculation so now it s super fast  Below is the example code   Create the test dataframe with 50 000 unique groups  import random import string import pandas as pd import numpy as np np random seed 0     This is the total number of groups to be created NumberOfGroups   50000    Create a lot of groups  random strings of 4 letters  Group1           join random choice string ascii uppercase  for   in range 4   for x in range NumberOfGroups 10   10 Group2           join random choice string ascii uppercase  for   in range 4   for x in range NumberOfGroups 2   2 FinalGroup       join random choice string ascii uppercase  for   in range 4   for x in range NumberOfGroups      Make the numbers NumbersForPercents    np random randint 100  999  for   in range NumberOfGroups      Make the dataframe df   pd DataFrame   Group 1   Group1                      Group 2   Group2                      Final Group   FinalGroup                      Numbers I want as percents   NumbersForPercents     When grouped it looks like                                Numbers I want as percents Group 1 Group 2 Final Group                             AAAH    AQYR    RMCH                                847                 XDCL                                182         DQGO    ALVF                                132                 AVPH                                894         OVGH    NVOO                                650                 VKQP                                857         VNLY    HYFW                                884                 MOYH                                469         XOOC    GIDS                                168                 HTOY                                544 AACE    HNXU    RAXK                                243                 YZNK                                750         NOYI    NYGC                                399                 ZYCI                                614         QKGK    CRLF                                520                 UXNA                                970         TXAR    MLNB                                356                 NMFJ                                904         VQYG    NPON                                504                 QPKQ                                948      50000 rows x 1 columns    Array method of finding percentage     Initial grouping  basically a sorted version of df  PreGroupby df   df groupby   Group 1   Group 2   Final Group    agg   Numbers I want as percents    sum    reset index     Get the sum of values for the  final group   append   Sum  to it s column name  and change it into a dataframe   reset index  SumGroup df   df groupby   Group 1   Group 2    agg   Numbers I want as percents    sum    add suffix   Sum   reset index     Merge the two dataframes Percents df   pd merge PreGroupby df  SumGroup df    Divide the two columns Percents df  Percent of Final Group     Percents df  Numbers I want as percents     Percents df  Numbers I want as percents Sum     100   Drop the extra  Sum column Percents df drop   Numbers I want as percents Sum    inplace True  axis 1    This method takes about  0 15 seconds  Top answer method  using lambda function    state office   df groupby   Group 1   Group 2   Final Group    agg   Numbers I want as percents    sum    state pcts   state office groupby level   Group 1   Group 2    apply lambda x  100   x   float x sum       This method takes about  21 seconds to produce the same result   The result           Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group 0        AAAH    AQYR        RMCH                         847               82 312925 1        AAAH    AQYR        XDCL                         182               17 687075 2        AAAH    DQGO        ALVF                         132               12 865497 3        AAAH    DQGO        AVPH                         894               87 134503 4        AAAH    OVGH        NVOO                         650               43 132050 5        AAAH    OVGH        VKQP                         857               56 867950 6        AAAH    VNLY        HYFW                         884               65 336290 7        AAAH    VNLY        MOYH                         469               34 663710 8        AAAH    XOOC        GIDS                         168               23 595506 9        AAAH    XOOC        HTOY                         544               76 404494

User · Answer

You need to make a second groupby object that groups by the states  and then use the div method   import numpy as np import pandas as pd np random seed 0  df   pd DataFrame   state     CA    WA    CO    AZ     3                  office id   list range 1  7     2                  sales    np random randint 100000  999999  for   in range 12      state office   df groupby   state    office id    agg   sales    sum    state   df groupby   state    agg   sales    sum    state office div state  level  state     100                        sales state office id            AZ    2          16 981365       4          19 250033       6          63 768601 CA    1          19 331879       3          33 858747       5          46 809373 CO    1          36 851857       3          19 874290       5          43 273852 WA    2          34 707233       4          35 511259       6          29 781508   the level  state  kwarg in div tells pandas to broadcast join the dataframes base on the values in the state level of the index

User · Answer

For conciseness I d use the SeriesGroupBy   In  11   c   df groupby   state    office id     sales   sum   rename  count    In  12   c Out 12   state  office id AZ     2            925105        4            592852        6            362198 CA     1            819164        3            743055        5            292885 CO     1            525994        3            338378        5            490335 WA     2            623380        4            441560        6            451428 Name  count  dtype  int64  In  13   c   c groupby level 0  sum   Out 13   state  office id AZ     2            0 492037        4            0 315321        6            0 192643 CA     1            0 441573        3            0 400546        5            0 157881 CO     1            0 388271        3            0 249779        5            0 361949 WA     2            0 411101        4            0 291196        6            0 297703 Name  count  dtype  float64   For multiple groups you have to use transform  using Radical s df    In  21   c    df groupby   Group 1   Group 2   Final Group     Numbers I want as percents   sum   rename  count    In  22   c   c groupby level  0  1   transform  sum   Out 22   Group 1  Group 2  Final Group AAHQ     BOSC     OWON           0 331006                   TLAM           0 668994          MQVF     BWSI           0 288961                   FXZM           0 711039          ODWV     NFCH           0 262395     Name  count  dtype  float64   This seems to be slightly more performant than the other answers  just less than twice the speed of Radical s answer  for me  0 08s

[python] Pandas percentage of total with groupby

1st Andy Hayden

2nd Paul H

3rd exp1orer

Examples related to python

Examples related to pandas