Perform a Shapiro-Wilk Normality Test

Question

I want to perform a Shapiro-Wilk Normality Test test  My data is csv format  It looks like this    heisenberg     HWWIchg 1    -15 60 2    -21 60 3    -19 50 4    -19 10 5    -20 90 6    -20 70 7    -19 30 8    -18 30 9    -15 10   However  when I perform the test  I get    shapiro test heisenberg       Error in   data frame x  complete cases x          undefined columns selected   Why isnt t R selecting the right column and how do I do that

User · Answer

You are applying shapiro test   to a data frame instead of the column  Try the following   shapiro test heisenberg HWWIchg

User · Answer

Set the data as a vector and then place in the function

User · Answer

You failed to specify the exact columns  data  to test for normality  Use this instead  shapiro test heisenberg HWWIchg

User · Answer

What does shapiro test do   shapiro test tests the Null hypothesis that  the samples come from a Normal distribution  against the alternative hypothesis  the samples do not come from a Normal distribution    How to perform shapiro test in R   The R help page for  shapiro test gives    x - a numeric vector of data values  Missing values are allowed       but the number of non-missing values must be between 3 and 5000    That is  shapiro test expects a numeric vector as input  that corresponds to the sample you would like to test and it is the only input required  Since you ve a data frame  you ll have to pass the desired column as input to the function as follows    gt  shapiro test heisenberg HWWIchg      Shapiro-Wilk normality test    data   heisenberg HWWIchg    W   0 9001  p-value   0 2528   Interpreting results from shapiro test   First  I strongly suggest you read this excellent answer from Ian Fellows on testing for normality   As shown above  the shapiro test tests the NULL hypothesis that the samples came from a Normal distribution  This means that if your p-value  lt   0 05  then you would reject the NULL hypothesis that the samples came from a Normal distribution  As Ian Fellows nicely put it  you are testing against the assumption of Normality   In other words  correct me if I am wrong   it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution  Why  Because  rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis    In case of the null hypothesis of shapiro test  a p-value  lt   0 05 would reject the null hypothesis that the samples come from normal distribution  To put it loosely  there is a rare chance that the samples came from a normal distribution  The side-effect of this hypothesis testing is that this rare chance happens very rarely  To illustrate  take for example   set seed 450  x  lt - runif 50  min 2  max 4  shapiro test x      Shapiro-Wilk normality test   data   runif 50  min   2  max   4     W   0 9601  p-value   0 08995   So  this  particular  sample runif 50  min 2  max 4  comes from a normal distribution according to this test  What I am trying to say is that  there are many many cases under which the  extreme  requirements  p  lt  0 05  are not satisfied which leads to acceptance of  NULL hypothesis  most of the times  which might be misleading   Another issue I d like to quote here from  PaulHiemstra from under comments about the effects on large sample size       An additional issue with the Shapiro-Wilk s test is that when you feed it more data  the chances of the null hypothesis being rejected becomes larger  So what happens is that for large amounts of data even very small deviations from normality can be detected  leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough    Although he also points out that R s data size limit protects this a bit       Luckily shapiro test protects the user from the above described effect by limiting the data size to 5000    If the NULL hypothesis were the opposite  meaning  the samples do not come from a normal distribution  and you get a p-value  lt  0 05  then you conclude that it is very rare that these samples do not come from a normal distribution  reject the NULL hypothesis   That loosely translates to  It is highly likely that the samples are normally distributed  although some statisticians may not like this way of interpreting   I believe this is what Ian Fellows also tried to explain in his post  Please correct me if I ve gotten something wrong    PaulHiemstra also comments about practical situations  example regression  when one comes across this problem of testing for normality      In practice  if an analysis assumes normality  e g  lm  I would not do this Shapiro-Wilk s test  but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much  For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot lm     Statistics is not a series of steps that cough up a few numbers  hey p  lt  0 05   but requires a lot of experience and skill in judging how to analysis your data correctly    Here  I find the reply from Ian Fellows to Ben Bolker s comment under the same question already linked above equally  if not more  informative      For linear regression           Don t worry much about normality  The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine     Worry about unequal variances  heteroskedasticity   I worry about this to the point of  almost  using HCCM tests by default  A scale location plot will give some idea of whether this is broken  but not always  Also  there is no a priori reason to assume equal variances in most cases     Outliers  A cooks distance of   1 is reasonable cause for concern          Those are my thoughts  FWIW     Hope this clears things up a bit

[r] Perform a Shapiro-Wilk Normality Test

Examples related to r

Examples related to statistics

Examples related to normal-distribution