I have some problems with the Pandas apply function, when using multiple columns with the following dataframe
df = DataFrame ({'a' : np.random.randn(6),
'b' : ['foo', 'bar'] * 3,
'c' : np.random.randn(6)})
and the following function
def my_test(a, b):
return a % b
When I try to apply this function with :
df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)
I get the error message:
NameError: ("global name 'a' is not defined", u'occurred at index 0')
I do not understand this message, I defined the name properly.
I would highly appreciate any help on this issue
Update
Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ''. However I still get the same issue using a more complex function such as:
def my_test(a):
cum_diff = 0
for ix in df.index():
cum_diff = cum_diff + (a - df['a'][ix])
return cum_diff
This question is related to
python
python-2.7
pandas
dataframe
apply
Let's say we want to apply a function add5 to columns 'a' and 'b' of DataFrame df
def add5(x):
return x+5
df[['a', 'b']].apply(add5)
If you just want to compute (column a) % (column b), you don't need apply
, just do it directly:
In [7]: df['a'] % df['c']
Out[7]:
0 -1.132022
1 -0.939493
2 0.201931
3 0.511374
4 -0.694647
5 -0.023486
Name: a
This is same as the previous solution but I have defined the function in df.apply itself:
df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)
All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).
import pandas as pd
import numpy as np
df = pd.DataFrame ({'a' : np.random.randn(6),
'b' : ['foo', 'bar'] * 3,
'c' : np.random.randn(6)})
Example 1: looping with pandas.apply()
:
%%timeit
def my_test2(row):
return row['a'] % row['c']
df['Value'] = df.apply(my_test2, axis=1)
The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop
Example 2: vectorize using pandas.apply()
:
%%timeit
df['a'] % df['c']
The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop
Example 3: vectorize using numpy arrays:
%%timeit
df['a'].values % df['c'].values
The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop
So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.
I have given the comparison of all three discussed above.
Using values
%timeit df['value'] = df['a'].values % df['c'].values
139 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Without values
%timeit df['value'] = df['a']%df['c']
216 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Apply function
%timeit df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)
474 µs ± 5.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Source: Stackoverflow.com