Apply function to pandas groupby

Question

I have a pandas dataframe with a column called my labels which contains strings   A    B    C    D    E   I would like to count the number of occurances of each of these strings then divide the number of counts by the sum of all the counts  I m trying to do this in Pandas like this   func   lambda x  x size     x sum   data   frame groupby  my labels   apply func    This code throws an error   DataFrame object has no attribute  size   How can I apply a function to calculate this in Pandas

User · Accepted Answer

apply takes a function to apply to each value  not the series  and accepts kwargs  So  the values do not have the  size   method   Perhaps this would work   from pandas import    d     my label   Series   A   B   A   C   D   D   E     df   DataFrame d    def as perc value  total       return value float total   def get count values       return len values   grouped count   df groupby  my label   my label agg get count  data   grouped count apply as perc  total df my label count      The  agg   method here takes a function that is applied to all values of the groupby object

User · Answer

Regarding the issue with  size   size is not a function on a dataframe  it is rather a property  So instead of using size    plain size should work  Apart from that  a method like this should work    def doCalculation df       groupCount   df size     groupSum   df  my labels   notnull   sum        return groupCount   groupSum  dataFrame groupby  my labels   apply doCalculation

User · Answer

As of Pandas version 0 22  there exists also an alternative to apply  pipe  which can be considerably faster than using apply  you can also check this question for more differences between the two functionalities    For your example   df   pd DataFrame   my label     A   B   A   C   D   D   E        my label 0        A 1        B 2        A 3        C 4        D 5        D 6        E   The apply version  df groupby  my label   apply lambda grp  grp count     df shape 0     gives            my label my label           A         0 285714 B         0 142857 C         0 142857 D         0 285714 E         0 142857   and the pipe version  df groupby  my label   pipe lambda grp  grp size     grp size   sum      yields  my label A    0 285714 B    0 142857 C    0 142857 D    0 285714 E    0 142857   So the values are identical  however  the timings differ quite a lot  at least for this small dataframe     timeit df groupby  my label   apply lambda grp  grp count     df shape 0   100 loops  best of 3  5 52 ms per loop   and   timeit df groupby  my label   pipe lambda grp  grp size     grp size   sum    1000 loops  best of 3  843   s per loop   Wrapping it into a function is then also straightforward   def get perc grp obj       gr size   grp obj size       return gr size   gr size sum     Now you can call  df groupby  my label   pipe get perc    yielding  my label A    0 285714 B    0 142857 C    0 142857 D    0 285714 E    0 142857   However  for this particular case  you do not even need a groupby  but you can just use value counts like this   df  my label   value counts sort False    df shape 0    yielding  A    0 285714 C    0 142857 B    0 142857 E    0 142857 D    0 285714 Name  my label  dtype  float64   For this small dataframe it is quite fast   timeit df  my label   value counts sort False    df shape 0  1000 loops  best of 3  770   s per loop   As pointed out by  anmol  the last statement can also be simplified to  df  my label   value counts sort False  normalize True

User · Answer

Try   g   pd DataFrame   A   B   A   C   D   D   E       Group by the contents of column 0  gg   g groupby 0       Create a DataFrame with the counts of each letter histo   gg apply lambda x  x count       Add a new column that is the count   total number of elements     histo 1    histo astype np float  len g    print histo   Output      0         1 0              A  2  0 285714 B  1  0 142857 C  1  0 142857 D  2  0 285714 E  1  0 142857

User · Answer

I saw a nested function technique for computing a weighted average on S O  one time  altering that technique can solve your issue   def group weight overall size       def inner group           return len group  float overall size      inner   name      weight      return inner  d     my label   pd Series   A   B   A   C   D   D   E     df   pd DataFrame d  print df groupby  my label   apply group weight len df       my label A    0 285714 B    0 142857 C    0 142857 D    0 285714 E    0 142857 dtype  float64   Here is how to do a weighted average within groups  def wavg val col name wt col name       def inner group           return  group val col name    group wt col name   sum     group wt col name  sum       inner   name      wgt avg      return inner    d     P   pd Series   A   B   A   C   D   D   E           Q   pd Series  1 2 3 4 5 6 7         R   pd Series  0 1 0 2 0 3 0 4 0 5 0 6 0 7           df   pd DataFrame d  print df groupby  P   apply wavg  Q   R     P A    2 500000 B    2 000000 C    4 000000 D    5 545455 E    7 000000 dtype  float64

[python] Apply function to pandas groupby

Examples related to python

Examples related to pandas