[r] Calculate correlation with cor(), only for numerical columns

I have a dataframe and would like to calculate the correlation (with Spearman, data is categorical and ranked) but only for a subset of columns. I tried with all, but R's cor() function only accepts numerical data (x must be numeric, says the error message), even if Spearman is used.

One brute approach is to delete the non-numerical columns from the dataframe. This is not as elegant, for speed I still don't want to calculate correlations between all columns.

I hope there is a way to simply say "calculate correlations for columns x, y, z". Column references could by number or by name. I suppose the flexible way to provide them would be through a vector.

Any suggestions are appreciated.

This question is related to r correlation

The answer is


For numerical data you have the solution. But it is categorical data, you said. Then life gets a bit more complicated...

Well, first : The amount of association between two categorical variables is not measured with a Spearman rank correlation, but with a Chi-square test for example. Which is logic actually. Ranking means there is some order in your data. Now tell me which is larger, yellow or red? I know, sometimes R does perform a spearman rank correlation on categorical data. If I code yellow 1 and red 2, R would consider red larger than yellow.

So, forget about Spearman for categorical data. I'll demonstrate the chisq-test and how to choose columns using combn(). But you would benefit from a bit more time with Agresti's book : http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937

set.seed(1234)
X <- rep(c("A","B"),20)
Y <- sample(c("C","D"),40,replace=T)

table(X,Y)
chisq.test(table(X,Y),correct=F)
# I don't use Yates continuity correction

#Let's make a matrix with tons of columns

Data <- as.data.frame(
          matrix(
            sample(letters[1:3],2000,replace=T),
            ncol=25
          )
        )

# You want to select which columns to use
columns <- c(3,7,11,24)
vars <- names(Data)[columns]

# say you need to know which ones are associated with each other.
out <-  apply( combn(columns,2),2,function(x){
          chisq.test(table(Data[,x[1]],Data[,x[2]]),correct=F)$p.value
        })

out <- cbind(as.data.frame(t(combn(vars,2))),out)

Then you should get :

> out
   V1  V2       out
1  V3  V7 0.8116733
2  V3 V11 0.1096903
3  V3 V24 0.1653670
4  V7 V11 0.3629871
5  V7 V24 0.4947797
6 V11 V24 0.7259321

Where V1 and V2 indicate between which variables it goes, and "out" gives the p-value for association. Here all variables are independent. Which you would expect, as I created the data at random.


Another option would be to just use the excellent corrr package https://github.com/drsimonj/corrr and do

require(corrr)
require(dplyr)

myData %>% 
   select(x,y,z) %>%  # or do negative or range selections here
   correlate() %>%
   rearrange() %>%  # rearrange by correlations
   shave() # Shave off the upper triangle for a cleaner result

Steps 3 and 4 are entirely optional and are just included to demonstrate the usefulness of the package.


I found an easier way by looking at the R script generated by Rattle. It looks like below:

correlations <- cor(mydata[,c(1,3,5:87,89:90,94:98)], use="pairwise", method="spearman")