[python] Detect and exclude outliers in Pandas data frame

I have a pandas data frame with few columns.

Now I know that certain rows are outliers based on a certain column value.

For instance

column 'Vol' has all values around 12xx and one value is 4000 (outlier).

Now I would like to exclude those rows that have Vol column like this.

So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.

What is an elegant way to achieve this?

This question is related to python pandas filtering dataframe outliers

The answer is


This answer is similar to that provided by @tanemaki, but uses a lambda expression instead of scipy stats.

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:

df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]

See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe


Since I am in a very early stage of my data science journey, I am treating outliers with the code below.

#Outlier Treatment

def outlier_detect(df):
    for i in df.describe().columns:
        Q1=df.describe().at['25%',i]
        Q3=df.describe().at['75%',i]
        IQR=Q3 - Q1
        LTV=Q1 - 1.5 * IQR
        UTV=Q3 + 1.5 * IQR
        x=np.array(df[i])
        p=[]
        for j in x:
            if j < LTV or j>UTV:
                p.append(df[i].median())
            else:
                p.append(j)
        df[i]=p
    return df

I prefer to clip rather than drop. the following will clip inplace at the 2nd and 98th pecentiles.

df_list = list(df)
minPercentile = 0.02
maxPercentile = 0.98

for _ in range(numCols):
    df[df_list[_]] = df[df_list[_]].clip((df[df_list[_]].quantile(minPercentile)),(df[df_list[_]].quantile(maxPercentile)))

a full example with data and 2 groups follows:

Imports:

from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)

Data example with 2 groups: G1:Group 1. G2: Group 2:

TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1

1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6

2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6

2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")

Read text data to pandas dataframe:

df = pd.read_csv(TESTDATA, sep=";")

Define the outliers using standard deviations

stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
           lambda group: (group - group.mean()).abs().div(group.std())) > stds

Define filtered data values and the outliers:

dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]

Print the result:

print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)

My function for dropping outliers

def drop_outliers(df, field_name):
    distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
    df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)

For each of your dataframe column, you could get quantile with:

q = df["col"].quantile(0.99)

and then filter with:

df[df["col"] < q]

If one need to remove lower and upper outliers, combine condition with an AND statement:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

Use boolean indexing as you would do in numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

For a series it is similar:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

Get the 98th and 2nd percentile as the limits of our outliers

upper_limit = np.percentile(X_train.logerror.values, 98) 
lower_limit = np.percentile(X_train.logerror.values, 2) # Filter the outliers from the dataframe
data[‘target’].loc[X_train[‘target’]>upper_limit] = upper_limit data[‘target’].loc[X_train[‘target’]<lower_limit] = lower_limit

#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

scipy.stats has methods trim1() and trimboth() to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.


If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

  • For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation.
  • Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy the constraint.
  • Finally, result of this condition is used to index the dataframe.

Filter other columns based on a single column

  • Specify a column for the zscore, df[0] for example, and remove .all(axis=1).
df[(np.abs(stats.zscore(df[0])) < 3)]

If you like method chaining, you can get your boolean condition for all numeric columns like this:

df.sub(df.mean()).div(df.std()).abs().lt(3)

Each value of each column will be converted to True/False based on whether its less than three standard deviations away from the mean or not.


For each series in the dataframe, you could use between and quantile to remove outliers.

x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers

Since I haven't seen an answer that deal with numerical and non-numerical attributes, here is a complement answer.

You might want to drop the outliers only on numerical attributes (categorical variables can hardly be outliers).

Function definition

I have extended @tanemaki's suggestion to handle data when non-numeric attributes are also present:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

Usage

drop_numerical_outliers(df)

Example

Imagine a dataset df with some values about houses: alley, land contour, sale price, ... E.g: Data Documentation

First, you want to visualise the data on a scatter graph (with z-score Thresh=3):

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

Before - Gr Liv Area Versus SalePrice

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

After - Gr Liv Area Versus SalePrice


You can use boolean mask:

import pandas as pd

def remove_outliers(df, q=0.05):
    upper = df.quantile(1-q)
    lower = df.quantile(q)
    mask = (df < upper) & (df > lower)
    return mask

t = pd.DataFrame({'train': [1,1,2,3,4,5,6,7,8,9,9],
                  'y': [1,0,0,1,1,0,0,1,1,1,0]})

mask = remove_outliers(t['train'], 0.1)

print(t[mask])

output:

   train  y
2      2  0
3      3  1
4      4  1
5      5  0
6      6  0
7      7  1
8      8  1

Deleting and dropping outliers I believe is wrong statistically. It makes the data different from original data. Also makes data unequally shaped and hence best way is to reduce or avoid the effect of outliers by log transform the data. This worked for me:

np.log(data.iloc[:, :])

Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

Original data

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Winsorized data


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to filtering

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() Filtering array of objects with lodash based on property value How can I return the difference between two lists? I have filtered my Excel data and now I want to number the rows. How do I do that? Creating lowpass filter in SciPy - understanding methods and units filter items in a python dictionary where keys contain a specific string Detect and exclude outliers in Pandas data frame Filtering Pandas DataFrames on dates Logical operators for boolean indexing in Pandas How to run a SQL query on an Excel table?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to outliers

Detect and exclude outliers in Pandas data frame How to remove outliers in boxplot in R? How to remove outliers from a dataset