Having issue filtering my result dataframe with an or
condition. I want my result df
to extract all column var
values that are above 0.25 and below -0.25.
This logic below gives me an ambiguous truth value however it work when I split this filtering in two separate operations. What is happening here? not sure where to use the suggested a.empty(), a.bool(), a.item(),a.any() or a.all()
.
result = result[(result['var']>0.25) or (result['var']<-0.25)]
Well pandas use bitwise &
|
and each condition should be wrapped in a ()
For example following works
data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]
But the same query without proper brackets does not
data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]
For boolean logic, use &
and |
.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
>>> df
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
To see what is happening, you get a column of booleans for each comparison, e.g.
df.C > 0.25
0 True
1 False
2 False
3 True
4 True
Name: C, dtype: bool
When you have multiple criteria, you will get multiple columns returned. This is why the join logic is ambiguous. Using and
or or
treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.
# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True
# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False
One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.
>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
For more details, refer to Boolean Indexing in the docs.
You need to use bitwise operators |
instead of or
and &
instead of and
in pandas, you can't simply use the bool statements from python.
For much complex filtering create a mask
and apply the mask on the dataframe.
Put all your query in the mask and apply it.
Suppose,
mask = (df["col1"]>=df["col2"]) & (stock["col1"]<=df["col2"])
df_new = df[mask]
I encountered the same error and got stalled with a pyspark dataframe for few days, I was able to resolve it successfully by filling na values with 0 since I was comparing integer values from 2 fields.
Or, alternatively, you could use Operator module. More detailed information is here Python docs
import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]
A B C
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.4438
One minor thing, which wasted my time.
Put the conditions(if comparing using " = ", " != ") in parenthesis, failing to do so also raises this exception. This will work
df[(some condition) conditional operator (some conditions)]
This will not
df[some condition conditional-operator some condition]
I'll try to give the benchmark of the three most common way (also mentioned above):
from timeit import repeat
setup = """
import numpy as np;
import random;
x = np.linspace(0,100);
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) * (x <= ub)]', 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'
for _ in range(3):
for stmt in stmts:
t = min(repeat(stmt, setup, number=100_000))
print('%.4f' % t, stmt)
print()
result:
0.4808 x[(x > lb) * (x <= ub)]
0.4726 x[(x > lb) & (x <= ub)]
0.4904 x[np.logical_and(x > lb, x <= ub)]
0.4725 x[(x > lb) * (x <= ub)]
0.4806 x[(x > lb) & (x <= ub)]
0.5002 x[np.logical_and(x > lb, x <= ub)]
0.4781 x[(x > lb) * (x <= ub)]
0.4336 x[(x > lb) & (x <= ub)]
0.4974 x[np.logical_and(x > lb, x <= ub)]
But, *
is not supported in Panda Series, and NumPy Array is faster than pandas data frame (arround 1000 times slower, see number):
from timeit import repeat
setup = """
import numpy as np;
import random;
import pandas as pd;
x = pd.DataFrame(np.linspace(0,100));
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'
for _ in range(3):
for stmt in stmts:
t = min(repeat(stmt, setup, number=100))
print('%.4f' % t, stmt)
print()
result:
0.1964 x[(x > lb) & (x <= ub)]
0.1992 x[np.logical_and(x > lb, x <= ub)]
0.2018 x[(x > lb) & (x <= ub)]
0.1838 x[np.logical_and(x > lb, x <= ub)]
0.1871 x[(x > lb) & (x <= ub)]
0.1883 x[np.logical_and(x > lb, x <= ub)]
Note: adding one line of code x = x.to_numpy()
will need about 20 µs.
For those who prefer %timeit
:
import numpy as np
import random
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
lb, ub
x = pd.DataFrame(np.linspace(0,100))
def asterik(x):
x = x.to_numpy()
return x[(x > lb) * (x <= ub)]
def and_symbol(x):
x = x.to_numpy()
return x[(x > lb) & (x <= ub)]
def numpy_logical(x):
x = x.to_numpy()
return x[np.logical_and(x > lb, x <= ub)]
for i in range(3):
%timeit asterik(x)
%timeit and_symbol(x)
%timeit numpy_logical(x)
print('\n')
result:
23 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.6 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
31.3 µs ± 8.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.4 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.9 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.7 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
25.1 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
36.8 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.2 µs ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query
method:
result = result.query("(var > 0.25) or (var < -0.25)")
See also http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.
(Some tests with a dataframe I'm currently working with suggest that this method is a bit slower than using the bitwise operators on series of booleans: 2 ms vs. 870 µs)
A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2
, WT_38hph_input_2
and log2(WT_38hph_IP_2/WT_38hph_input_2)
and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"
I obtained the following exception cascade:
KeyError: 'log2'
UndefinedVariableError: name 'log2' is not defined
ValueError: "log2" is not a supported function
I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.
A possible workaround is proposed here.
Source: Stackoverflow.com