As of Pandas version 0.22, there exists also an alternative to apply
: pipe
, which can be considerably faster than using apply
(you can also check this question for more differences between the two functionalities).
For your example:
df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})
my_label
0 A
1 B
2 A
3 C
4 D
5 D
6 E
The apply
version
df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
gives
my_label
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
and the pipe
version
df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
yields
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
So the values are identical, however, the timings differ quite a lot (at least for this small dataframe):
%timeit df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
100 loops, best of 3: 5.52 ms per loop
and
%timeit df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
1000 loops, best of 3: 843 µs per loop
Wrapping it into a function is then also straightforward:
def get_perc(grp_obj):
gr_size = grp_obj.size()
return gr_size / gr_size.sum()
Now you can call
df.groupby('my_label').pipe(get_perc)
yielding
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
However, for this particular case, you do not even need a groupby
, but you can just use value_counts
like this:
df['my_label'].value_counts(sort=False) / df.shape[0]
yielding
A 0.285714
C 0.142857
B 0.142857
E 0.142857
D 0.285714
Name: my_label, dtype: float64
For this small dataframe it is quite fast
%timeit df['my_label'].value_counts(sort=False) / df.shape[0]
1000 loops, best of 3: 770 µs per loop
As pointed out by @anmol, the last statement can also be simplified to
df['my_label'].value_counts(sort=False, normalize=True)