Given the following dataframe df
and the function complex_function
,
import pandas as pd
def complex_function(x, y=0):
if x > 5 and x > y:
return 1
else:
return 2
df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})
col1 col2
0 1 6
1 4 7
2 6 1
3 2 2
4 7 8
there are several solutions to use apply() on only one column. In the following I will explain them in detail.
The straightforward solution is the one from @Fabio Lamanna:
df['col1'] = df['col1'].apply(complex_function)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 1 8
Only the first column is modified, the second column is unchanged. The solution is beautiful. It is just one line of code and it reads almost like english: "Take 'col1' and apply the function complex_function to it."
However, if you need data from another column, e.g. 'col2', it's not working. If you want to pass the values of 'col2' to variable y
of the complex_function
, you need something else.
Alternatively, you could use the whole dataframe as described in this or this SO post:
df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)
or if you prefer (like me) a solution without a lambda function:
def apply_complex_function(x): return complex_function(x['col1'])
df['col1'] = df.apply(apply_complex_function, axis=1)
There is a lot going on in this solution that needs to be explained. The apply() function works on pd.Series and pd.DataFrame. But you cannot use df['col1'] = df.apply(complex_function).loc[:, 'col1']
, because it would throw a ValueError
.
Hence, you need to give the information which column to use. To complicate things, the apply() function does only accept callables. To solve this, you need to define a (lambda) function with the column x['col1']
as argument; i.e. we wrap the column information in another function.
Unfortunately, the default value of the axis parameter is zero (axis=0
), which means it will try executing column-wise and not row-wise. This wasn't a problem in the first solution, because we gave apply() a pd.Series. But now the input is a dataframe and we must be explicit (axis=1
). (I marvel how often I forget this.)
Whether you prefer the version with the lambda function or without is subjective. In my opinion the line of code is complicated enough to read even without a lambda function thrown in. You only need the (lambda) function as a wrapper. It is just boiler code. A reader should not be bothered with it.
Now, you can modify this solution easily to take the second column into account:
def apply_complex_function(x): return complex_function(x['col1'], x['col2'])
df['col1'] = df.apply(apply_complex_function, axis=1)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 2 8
At index 4 the value has changed from 1 to 2, because the first condition 7 > 5
is true but the second condition 7 > 8
is false.
Note that you only needed to change the first line of code (i.e. the function) and not the second line.
Never put the column information into your function.
def bad_idea(x):
return x['col1'] ** 2
By doing this, you make a general function dependent on a column name! This is a bad idea, because the next time you want to use this function, you cannot. Worse: Maybe you rename a column in a different dataframe just to make it work with your existing function. (Been there, done that. It is a slippery slope!)
Although the OP specifically asked for a solution with apply(), alternative solutions were suggested. For example, the answer of @George Petrov suggested to use map(), the answer of @Thibaut Dubernet proposed assign().
I fully agree that apply() is seldom the best solution, because apply() is not vectorized. It is an element-wise operation with expensive function calling and overhead from pd.Series.
One reason to use apply() is that you want to use an existing function and performance is not an issue. Or your function is so complex that no vectorized version exists.
Another reason to use apply() is in combination with groupby(). Please note that DataFrame.apply() and GroupBy.apply() are different functions.
So it does make sense to consider some alternatives:
map()
only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details. df['col1'] = df['col1'].map(complex_function)
applymap()
is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route. df['col1'] = df.applymap(complex_function).loc[:, 'col1']
assign()
is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function
. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe. df['col1'] = df.assign(col1=df.col1.apply(complex_function))
I only mention it here because it was suggested by other answers, e.g. @durjoy. The list is not exhaustive:
.loc
. My example complex_function
could be refactored in this way.raw=True
parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost.