Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop
all the data_columns
from the df
, and then use a join
to put them back in after passing them through pd.to_numeric
(with option 'coerce'
, such that all non-numeric entries are converted to NaN
). The result is saved to num_df
.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric
is coercing to NaN
everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25'
will be recognized as the numeric value 1.25
.
Disclaimer: pd.to_numeric
was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5