As I felt similarly confused with .transform
operation vs. .apply
I found a few answers shedding some light on the issue. This answer for example was very helpful.
My takeout so far is that .transform
will work (or deal) with Series
(columns) in isolation from each other. What this means is that in your last two calls:
df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
You asked .transform
to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform
will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column)
times.
So this scalar, that should be used by .transform
to make the Series
is a result of some reduction function applied on an input Series
(and only on ONE series/column at a time).
Consider this example (on your dataframe):
zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
will yield:
C D
0 0.989 0.128
1 -0.478 0.489
2 0.889 -0.589
3 -0.671 -1.150
4 0.034 -0.285
5 1.149 0.662
6 -1.404 -0.907
7 -0.509 1.653
Which is exactly the same as if you would use it on only on one column at a time:
df.groupby('A')['C'].transform(zscore)
yielding:
0 0.989
1 -0.478
2 0.889
3 -0.671
4 0.034
5 1.149
6 -1.404
7 -0.509
Note that .apply
in the last example (df.groupby('A')['C'].apply(zscore)
) would work in exactly the same way, but it would fail if you tried using it on a dataframe:
df.groupby('A').apply(zscore)
gives error:
ValueError: operands could not be broadcast together with shapes (6,) (2,)
So where else is .transform
useful? The simplest case is trying to assign results of reduction function back to original dataframe.
df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group
yielding:
A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373
Trying the same with .apply
would give NaNs
in sum_C
.
Because .apply
would return a reduced Series
, which it does not know how to broadcast back:
df.groupby('A')['C'].apply(sum)
giving:
A
bar 3.973
foo 4.373
There are also cases when .transform
is used to filter the data:
df[df.groupby(['B'])['D'].transform(sum) < -1]
A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179
I hope this adds a bit more clarity.