How to remove outliers from a dataset

Question

I ve got some multivariate data of beauty vs ages  The ages range from 20-40 at intervals of 2  20  22  24    40   and for each record of data  they are given an age and a beauty rating from 1-5  When I do boxplots of this data  ages across the X-axis  beauty ratings across the Y-axis   there are some outliers plotted outside the whiskers of each box   I want to remove these outliers from the data frame itself  but I m not sure how R calculates outliers for its box plots  Below is an example of what my data might look like

User · Answer

Outliers are quite similar to peaks  so a peak detector can be useful for identifying outliers  The method described here has quite good performance using z-scores  The animation part way down the page illustrates the method signaling on outliers  or peaks  Peaks are not always the same as outliers  but they re similar frequently  An example is shown here  This dataset is read from a sensor via serial communications  Occasional serial communication errors  sensor error or both lead to repeated  clearly erroneous data points  There is no statistical value in these point  They are arguably not outliers  they are errors  The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset

User · Answer

Use outline   FALSE as an option when you do the boxplot  read the help      gt  m  lt - c rnorm 10  5 10   gt  bp  lt - boxplot m  outline   FALSE

User · Answer

Nobody has posted the simplest answer    x  x  in  boxplot stats x  out    Also see this  http   www r-statistics com 2011 01 how-to-label-all-the-outliers-in-a-boxplot

User · Answer

x lt -quantile retentiondata sum dec incr c 0 01 0 99   data clean  lt - data data attribute  gt  x 1   amp  data attribute lt  x 2      I find this very easy to remove outliers  In the above example I am just extracting 2 percentile to 98 percentile of attribute values

User · Answer

I looked up for packages related to removing outliers  and found this package  surprisingly called  outliers     https   cran r-project org web packages outliers outliers pdf  if you go through it you see different ways of removing outliers and among them I found rm outlier most convenient one to use and as it says in the link above   If the outlier is detected and confirmed by statistical tests  this function can remove it or replace by sample mean or median  and also here is the usage part from the same source   Usage  rm outlier x  fill   FALSE  median   FALSE  opposite   FALSE    Arguments  x  a dataset  most frequently a vector  If argument is a dataframe  then outlier is removed from each column by sapply  The same behavior is applied by apply when the matrix is given  fill  If set to TRUE  the median or mean is placed instead of outlier  Otherwise  the outlier s  is are simply removed  median  If set to TRUE  median is used instead of mean in outlier replacement  opposite if set to TRUE  gives opposite value  if largest value has maximum difference from the mean  it gives smallest and vice versa

User · Answer

Wouldn t   z  lt - df df x  gt  quantile df x   25  - 1 5 IQR df x   amp           df x  lt  quantile df x   75    1 5 IQR df x      rows   accomplish this task quite easily

User · Answer

OK  you should apply something like this to your dataset  Do not replace  amp  save or you ll destroy your data  And  btw  you should  almost  never remove outliers from your data   remove outliers  lt - function x  na rm   TRUE           qnt  lt - quantile x  probs c  25   75   na rm   na rm         H  lt - 1 5   IQR x  na rm   na rm    y  lt - x   y x  lt   qnt 1  - H    lt - NA   y x  gt   qnt 2    H    lt - NA   y     To see it in action   set seed 1  x  lt - rnorm 100  x  lt - c -10  x  10  y  lt - remove outliers x     png   par mfrow   c 1  2   boxplot x  boxplot y     dev off     And once again  you should never do this on your own  outliers are just meant to be      EDIT  I added na rm   TRUE as default   EDIT2  Removed quantile function  added subscripting  hence made the function faster

User · Answer

Try this  Feed your variable in the function and save the o p in the variable which would contain removed outliers  outliers lt -function variable       iqr lt -IQR variable      q1 lt -as numeric quantile variable 0 25       q3 lt -as numeric quantile variable 0 75       mild low lt -q1- 1 5 iqr      mild high lt -q3  1 5 iqr      new variable lt -variable variable gt mild low  amp  variable lt mild high      return new variable

User · Answer

1 way to do that is  my NEW data frame  lt - my data frame -boxplot stats my data frame my column  out      or  my high value  lt - which my data frame age  gt  200   my data frame age  lt  0   my NEW data frame  lt - my data frame -my high value

User · Answer

Adding to  sefarkas  suggestion and using quantile as cut-offs  one could explore the following option   newdata  lt - subset mydata   mydata var  gt  quantile mydata var  probs c  01   99   2    mydata var  lt  quantile mydata var  probs c  01   99   1        This will remove the points points beyond the 99th quantile  Care should be taken like what aL3Xa was saying about keeping outliers  It should be removed only for getting an alternative conservative view of the data

User · Answer

The boxplot function returns the values used to do the plotting  which is actually then done by bxp     bstats  lt - boxplot count   spray  data   InsectSprays  col    lightgray     need to  waste  this plot bstats out  lt - NULL bstats group  lt - NULL bxp bstats     this will plot without any outlier points   I purposely did not answer the specific question because I consider it statistical malpractice to remove  outliers   I consider it acceptable practice to not plot them in a boxplot  but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record

[r] How to remove outliers from a dataset

Examples related to r

Examples related to statistics

Examples related to outliers