[r] Coerce multiple columns to factors at once

I have a sample data frame like below:

data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))

I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?

This question is related to r dataframe r-factor

The answer is


Here is another tidyverse approach using the modify_at() function from the purrr package.

library(purrr)

# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))

# Modify specified columns to a factor class
data_with_factors <- data %>%
    purrr::modify_at(c("A", "C", "E"), factor)


# Check the results:
str(data_with_factors)
# 'data.frame':   4 obs. of  10 variables:
#  $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
#  $ B: int  25 32 2 19
#  $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
#  $ D: int  11 7 27 6
#  $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
#  $ F: int  21 23 39 18
#  $ G: int  31 14 38 26
#  $ H: int  17 24 34 10
#  $ I: int  13 28 30 29
#  $ J: int  3 22 37 9

If you have another objective of getting in values from the table then using them to be converted, you can try the following way

### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]

This selects columns which are specifically character based and then converts them to factor.


Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.

library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))

factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)

data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]

and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:

data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
              data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)     

factoredData = data %>% mutate_if(is.character,funs(factor(.)))

It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.

## let us create a data.frame here

class <- c("7", "6", "5", "3")

cash <- c(100, 200, 300, 150)

height <- c(170, 180, 150, 165)

people <- data.frame(class, cash, height)

class(people) ## This is a dataframe 

## We now apply lapply to the data.frame as follows.

bb <- lapply(people, as.factor) %>% data.frame() 

## The lapply part returns a list which we coerce back to a data.frame

class(bb) ## A data.frame

##Now let us check the classes of the variables 

class(bb$class)

class(bb$height)

class(bb$cash) ## as expected, are all factors. 


You can use mutate_if (dplyr):

For example, coerce integer in factor:

mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b", 
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

# A tibble: 10 x 3
       a     b c    
   <int> <int> <chr>
 1     1     1 a    
 2     2     2 a    
 3     3     3 b    
 4     4     4 b    
 5     5     5 c    
 6     6     6 c    
 7     7     7 c    
 8     8     8 c    
 9     9     9 c    
10    10    10 c   

Use the function:

library(dplyr)

mydata%>%
    mutate_if(is.integer,as.factor)

# A tibble: 10 x 3
       a     b c    
   <fct> <fct> <chr>
 1     1     1 a    
 2     2     2 a    
 3     3     3 b    
 4     4     4 b    
 5     5     5 c    
 6     6     6 c    
 7     7     7 c    
 8     8     8 c    
 9     9     9 c    
10    10    10 c    

Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.

library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")

data %<>%
       mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame':  4 obs. of  10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int  15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int  14 4 22 20
# $ F: int  7 19 36 27
# $ G: int  35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int  17 1 9 25
# $ J: int  12 30 8 33

Or if we are using data.table, either use a for loop with set

setDT(data)
for(j in cols){
  set(data, i=NULL, j=j, value=factor(data[[j]]))
}

Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'

setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]

The more recent tidyverse way is to use the mutate_at function:

library(tidyverse)
library(magrittr)
set.seed(88)

data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")

data %<>% mutate_at(cols, funs(factor(.)))
str(data)
 $ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3   
 $ B: int  36 35 2 26
 $ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
 $ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
 $ E: int  3 14 30 38
 $ F: int  27 15 28 37
 $ G: int  19 11 6 21
 $ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
 $ I: int  23 24 13 8
 $ J: int  10 25 4 33

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to r-factor

Coerce multiple columns to factors at once Plotting with ggplot2: "Error: Discrete value supplied to continuous scale" on categorical y-axis R error "sum not meaningful for factors" How do I convert certain columns of a data frame to become factors? Colouring plot by factor in R Converting a factor to numeric without losing information R (as.numeric() doesn't seem to work) Imported a csv-dataset to R but the values becomes factors Drop unused factor levels in a subsetted data frame