Calculate correlation with cor only for numerical columns

Question

I have a dataframe and would like to calculate the correlation  with Spearman  data is categorical and ranked  but only for a subset of columns  I tried with all  but R s cor   function only accepts numerical data  x must be numeric  says the error message   even if Spearman is used   One brute approach is to delete the non-numerical columns from the dataframe  This is not as elegant  for speed I still don t want to calculate correlations between all columns    I hope there is a way to simply say  calculate correlations for columns x  y  z   Column references could by number or by name  I suppose the flexible way to provide them would be through a vector   Any suggestions are appreciated

User · Answer

For numerical data you have the solution. But it is categorical data, you said. Then life gets a bit more complicated...

Well, first : The amount of association between two categorical variables is not measured with a Spearman rank correlation, but with a Chi-square test for example. Which is logic actually. Ranking means there is some order in your data. Now tell me which is larger, yellow or red? I know, sometimes R does perform a spearman rank correlation on categorical data. If I code yellow 1 and red 2, R would consider red larger than yellow.

So, forget about Spearman for categorical data. I'll demonstrate the chisq-test and how to choose columns using combn(). But you would benefit from a bit more time with Agresti's book : http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937

set.seed(1234)
X <- rep(c("A","B"),20)
Y <- sample(c("C","D"),40,replace=T)

table(X,Y)
chisq.test(table(X,Y),correct=F)
# I don't use Yates continuity correction

#Let's make a matrix with tons of columns

Data <- as.data.frame(
          matrix(
            sample(letters[1:3],2000,replace=T),
            ncol=25
          )
        )

# You want to select which columns to use
columns <- c(3,7,11,24)
vars <- names(Data)[columns]

# say you need to know which ones are associated with each other.
out <-  apply( combn(columns,2),2,function(x){
          chisq.test(table(Data[,x[1]],Data[,x[2]]),correct=F)$p.value
        })

out <- cbind(as.data.frame(t(combn(vars,2))),out)

Then you should get :

> out
   V1  V2       out
1  V3  V7 0.8116733
2  V3 V11 0.1096903
3  V3 V24 0.1653670
4  V7 V11 0.3629871
5  V7 V24 0.4947797
6 V11 V24 0.7259321

Where V1 and V2 indicate between which variables it goes, and "out" gives the p-value for association. Here all variables are independent. Which you would expect, as I created the data at random.

User · Answer

I found an easier way by looking at the R script generated by Rattle  It looks like below   correlations  lt - cor mydata  c 1 3 5 87 89 90 94 98    use  pairwise   method  spearman

User · Answer

if you have a dataframe where some columns are numeric and some are other  character or factor  and you only want to do the correlations for the numeric columns  you could do the following   set seed 10   x   as data frame matrix rnorm 100   ncol   10   x L1   letters 1 10  x L2   letters 11 20   cor x   Error in cor x     x  must be numeric   but  cor x sapply x  is numeric                  V1         V2          V3          V4          V5          V6          V7 V1   1 00000000  0 3025766 -0 22473884 -0 72468776  0 18890578  0 14466161  0 05325308 V2   0 30257657  1 0000000 -0 27871430 -0 29075170  0 16095258  0 10538468 -0 15008158 V3  -0 22473884 -0 2787143  1 00000000 -0 22644156  0 07276013 -0 35725182 -0 05859479 V4  -0 72468776 -0 2907517 -0 22644156  1 00000000 -0 19305921  0 16948333 -0 01025698 V5   0 18890578  0 1609526  0 07276013 -0 19305921  1 00000000  0 07339531 -0 31837954 V6   0 14466161  0 1053847 -0 35725182  0 16948333  0 07339531  1 00000000  0 02514081 V7   0 05325308 -0 1500816 -0 05859479 -0 01025698 -0 31837954  0 02514081  1 00000000 V8   0 44705527  0 1698571  0 39970105 -0 42461411  0 63951574  0 23065830 -0 28967977 V9   0 21006372 -0 4418132 -0 18623823 -0 25272860  0 15921890  0 36182579 -0 18437981 V10  0 02326108  0 4618036 -0 25205899 -0 05117037  0 02408278  0 47630138 -0 38592733               V8           V9         V10 V1   0 447055266  0 210063724  0 02326108 V2   0 169857120 -0 441813231  0 46180357 V3   0 399701054 -0 186238233 -0 25205899 V4  -0 424614107 -0 252728595 -0 05117037 V5   0 639515737  0 159218895  0 02408278 V6   0 230658298  0 361825786  0 47630138 V7  -0 289679766 -0 184379813 -0 38592733 V8   1 000000000  0 001023392  0 11436143 V9   0 001023392  1 000000000  0 15301699 V10  0 114361431  0 153016985  1 00000000

User · Answer

Another option would be to just use the excellent corrr package https   github com drsimonj corrr and do  require corrr  require dplyr   myData   gt       select x y z    gt      or do negative or range selections here    correlate     gt      rearrange     gt      rearrange by correlations    shave     Shave off the upper triangle for a cleaner result   Steps 3 and 4 are entirely optional and are just included to demonstrate the usefulness of the package

[r] Calculate correlation with cor(), only for numerical columns

Examples related to r

Examples related to correlation