Is there a numpy builtin to reject outliers from a list

111

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed based on some assumed distribution of the points in d.

import numpy as np

def reject_outliers(data):
    m = 2
    u = np.mean(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered

>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]

I say 'something like' because the function might allow for varying distributions (poisson, gaussian, etc.) and varying outlier thresholds within those distributions (like the m I've used here).

This question is tagged with python numpy

~ Asked on 2012-07-27 11:19:17

The Best Answer is


117

This method is almost identical to yours, just more numpyst (also working on numpy arrays only):

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

~ Answered on 2012-07-27 11:22:30


194

Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.

Building on eumiro's answer:

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.

Note that for the data[s<m] syntax to work, data must be a numpy array.

~ Answered on 2013-05-15 09:58:26


Most Viewed Questions: