Is there a numpy builtin to reject outliers from a list

Question

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed based on some assumed distribution of the points in d.

import numpy as np

def reject_outliers(data):
    m = 2
    u = np.mean(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered

>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]

I say 'something like' because the function might allow for varying distributions (poisson, gaussian, etc.) and varying outlier thresholds within those distributions (like the m I've used here).

User · Answer

I would like to provide two methods in this answer, solution based on "z score" and solution based on "IQR".

The code provided in this answer works on both single dim numpy array and multiple numpy array.

Let's import some modules firstly.

import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr

z score based method

This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.

def sd_outlier(x, axis = None, bar = 3, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_z = stat.zscore(x, axis = axis)

    if side == 'gt':
        return d_z > bar
    elif side == 'lt':
        return d_z < -bar
    elif side == 'both':
        return np.abs(d_z) > bar

IQR based method

This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS's plot method.

def q1(x, axis = None):
    return np.percentile(x, 25, axis = axis)

def q3(x, axis = None):
    return np.percentile(x, 75, axis = axis)

def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_iqr = iqr(x, axis = axis)
    d_q1 = q1(x, axis = axis)
    d_q3 = q3(x, axis = axis)
    iqr_distance = np.multiply(d_iqr, bar)

    stat_shape = list(x.shape)

    if isinstance(axis, collections.Iterable):
        for single_axis in axis:
            stat_shape[single_axis] = 1
    else:
        stat_shape[axis] = 1

    if side in ['gt', 'both']:
        upper_range = d_q3 + iqr_distance
        upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
    if side in ['lt', 'both']:
        lower_range = d_q1 - iqr_distance
        lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)

    if side == 'gt':
        return upper_outlier
    if side == 'lt':
        return lower_outlier
    if side == 'both':
        return np.logical_or(upper_outlier, lower_outlier)

Finally, if you want to filter out the outliers, use a numpy selector.

Have a nice day.

User · Answer

An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)

As a simple example:

import numpy as np

# Create some random numbers
x = np.random.normal(5, 2, 1000)

# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500

# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)

rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

The output I get is:

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649

Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694

Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

Which is close to the expected value of 2.

If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):

y = x[abs(x - p50) < rSig*5]

# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

Which gives:

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

I have no idea which approach is the more efficent/robust

User · Answer

For a set of images  each image has 3 dimensions   where I wanted to reject outliers for each pixel I used  mean   np mean imgs  axis 0  std   np std imgs  axis 0  mask   np greater 0 5   std   1  np abs imgs - mean   masked   np multiply imgs  mask   Then it is possible to compute the mean  masked mean   np divide np sum masked  axis 0   np sum mask  axis 0     I use it for Background Subtraction

User · Answer

Here I find the outliers in x and substitute them with the median of a window of points  win  around them  taking from Benjamin Bannier answer the median deviation  def outlier smoother x  m 3  win 3  plots False           finds outliers in x  points  gt  m mdev x   mdev median deviation       and replaces them with the median of win points around them         x corr   np copy x      d   np abs x - np median x       mdev   np median d      idxs outliers   np nonzero d  gt  m mdev  0      for i in idxs outliers          if i-win  lt  0              x corr i    np median np append x 0 i   x i 1 i win 1            elif i win 1  gt  len x               x corr i    np median np append x i-win i   x i 1 len x             else              x corr i    np median np append x i-win i   x i 1 i win 1        if plots          plt figure  outlier smoother   clear True          plt plot x  label  orig    lw 5          plt plot idxs outliers  x idxs outliers    ro   label  outliers                                                                                                                               plt plot x corr   -o   label  corrected           plt legend            return x corr

User · Answer

I wanted to do something similar  except setting the number to NaN rather than removing it from the data  since if you remove it you change the length which can mess up plotting  i e  if you re only removing outliers from one column in a table  but you need it to remain the same as the other columns so you can plot them against each other    To do so I used numpy s masking functions   def reject outliers data  m 2       stdev   np std data      mean   np mean data      maskMin   mean - stdev   m     maskMax   mean   stdev   m     mask   np ma masked outside data  maskMin  maskMax      print  Masking values outside of    and     format maskMin  maskMax       return mask

User · Answer

Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.

Building on eumiro's answer:

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.

Note that for the data[s<m] syntax to work, data must be a numpy array.

User · Answer

if you want to get the index position of the outliers idx list will return it   def reject outliers data  m   2            d   np abs data - np median data           mdev   np median d          s   d mdev if mdev else 0          data range   np arange len data           idx list   data range s gt  m          return data s lt m   idx list  data points   np array  8  10  35  17  73  77     print reject outliers data points    after rejection    8 10 35 17   index positions of outliers   4 5

User · Answer

Building on Benjamin s  using pandas Series  and replacing MAD with IQR   def reject outliers sr  iq range 0 5       pcnt    1 - iq range    2     qlow  median  qhigh   sr dropna   quantile  pcnt  0 50  1-pcnt       iqr   qhigh - qlow     return sr   sr - median  abs    lt   iqr    For instance  if you set iq range 0 6  the percentiles of the interquartile-range would become  0 20  lt -- gt  0 80  so more outliers will be included

User · Answer

Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.

(Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is "more prone to such an error as the stdDv".)

You could try to iteratively apply your algorithm or you filter using the interquartile range: (here "factor" relates to a n*sigma range, yet only when your data follows a Gaussian distribution)

import numpy as np

def sortoutOutliers(dataIn,factor):
    quant3, quant1 = np.percentile(dataIn, [75 ,25])
    iqr = quant3 - quant1
    iqrSigma = iqr/1.34896
    medData = np.median(dataIn)
    dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ] 
    return(dataOut)

User · Answer

Benjamin Bannier s answer yields a pass-through when the median of distances from the median is 0  so I found this modified version a bit more helpful for cases as given in the example below   def reject outliers 2 data  m 2        d   np abs data - np median data       mdev   np median d      s   d    mdev if mdev else 1       return data s  lt  m    Example   data points   np array  10  10  10  17  10  10   print reject outliers data points   print reject outliers 2 data points     Gives     10  10  10  17  10  10      17 is not filtered  10  10  10  10  10     17 is filtered  it s distance  7  is greater than m

User · Answer

This method is almost identical to yours  just more numpyst  also working on numpy arrays only    def reject outliers data  m 2       return data abs data - np mean data    lt  m   np std data

[python] Is there a numpy builtin to reject outliers from a list

The answer is

z score based method

IQR based method

Examples related to python

Examples related to numpy

Tags