[r] Convert column classes in data.table

I have a problem using data.table: How do I convert column classes? Here is a simple example: With data.frame I don't have a problem converting it, with data.table I just don't know how:

df <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
#One way: http://stackoverflow.com/questions/2851015/r-convert-data-frame-columns-from-factors-to-characters
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
#Another way
df[, "value"] <- as.numeric(df[, "value"])

library(data.table)
dt <- data.table(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
dt <- data.table(lapply(dt, as.character), stringsAsFactors=FALSE) 
#Error in rep("", ncol(xi)) : invalid 'times' argument
#Produces error, does data.table not have the option stringsAsFactors?
dt[, "ID", with=FALSE] <- as.character(dt[, "ID", with=FALSE]) 
#Produces error: Error in `[<-.data.table`(`*tmp*`, , "ID", with = FALSE, value = "c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2)") : 
#unused argument(s) (with = FALSE)

Do I miss something obvious here?

Update due to Matthew's post: I used an older version before, but even after updating to 1.6.6 (the version I use now) I still get an error.

Update 2: Let's say I want to convert every column of class "factor" to a "character" column, but don't know in advance which column is of which class. With a data.frame, I can do the following:

classes <- as.character(sapply(df, class))
colClasses <- which(classes=="factor")
df[, colClasses] <- sapply(df[, colClasses], as.character)

Can I do something similar with data.table?

Update 3:

sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.6.6

loaded via a namespace (and not attached):
[1] tools_2.13.1

This question is related to r data.table

The answer is


Try this

DT <- data.table(X1 = c("a", "b"), X2 = c(1,2), X3 = c("hello", "you"))
changeCols <- colnames(DT)[which(as.vector(DT[,lapply(.SD, class)]) == "character")]

DT[,(changeCols):= lapply(.SD, as.factor), .SDcols = changeCols]

If you have a list of column names in data.table, you want to change the class of do:

convert_to_character <- c("Quarter", "value")

dt[, convert_to_character] <- dt[, lapply(.SD, as.character), .SDcols = convert_to_character]

This is a BAD way to do it! I'm only leaving this answer in case it solves other weird problems. These better methods are the probably partly the result of newer data.table versions... so it's worth while to document this hard way. Plus, this is a nice syntax example for eval substitute syntax.

library(data.table)
dt <- data.table(ID = c(rep("A", 5), rep("B",5)), 
                 fac1 = c(1:5, 1:5), 
                 fac2 = c(1:5, 1:5) * 2, 
                 val1 = rnorm(10),
                 val2 = rnorm(10))

names_factors = c('fac1', 'fac2')
names_values = c('val1', 'val2')

for (col in names_factors){
  e = substitute(X := as.factor(X), list(X = as.symbol(col)))
  dt[ , eval(e)]
}
for (col in names_values){
  e = substitute(X := as.numeric(X), list(X = as.symbol(col)))
  dt[ , eval(e)]
}

str(dt)

which gives you

Classes ‘data.table’ and 'data.frame':  10 obs. of  5 variables:
 $ ID  : chr  "A" "A" "A" "A" ...
 $ fac1: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5
 $ fac2: Factor w/ 5 levels "2","4","6","8",..: 1 2 3 4 5 1 2 3 4 5
 $ val1: num  0.0459 2.0113 0.5186 -0.8348 -0.2185 ...
 $ val2: num  -0.0688 0.6544 0.267 -0.1322 -0.4893 ...
 - attr(*, ".internal.selfref")=<externalptr> 

I tried several approaches.

# BY {dplyr}
data.table(ID      = c(rep("A", 5), rep("B",5)), 
           Quarter = c(1:5, 1:5), 
           value   = rnorm(10)) -> df1
df1 %<>% dplyr::mutate(ID      = as.factor(ID),
                       Quarter = as.character(Quarter))
# check classes
dplyr::glimpse(df1)
# Observations: 10
# Variables: 3
# $ ID      (fctr) A, A, A, A, A, B, B, B, B, B
# $ Quarter (chr) "1", "2", "3", "4", "5", "1", "2", "3", "4", "5"
# $ value   (dbl) -0.07676732, 0.25376110, 2.47192852, 0.84929175, -0.13567312,  -0.94224435, 0.80213218, -0.89652819...

, or otherwise

# from list to data.table using data.table::setDT
list(ID      = as.factor(c(rep("A", 5), rep("B",5))), 
     Quarter = as.character(c(1:5, 1:5)), 
     value   = rnorm(10)) %>% setDT(list.df) -> df2
class(df2)
# [1] "data.table" "data.frame"

I provide a more general and safer way to do this stuff,

".." <- function (x) 
{
  stopifnot(inherits(x, "character"))
  stopifnot(length(x) == 1)
  get(x, parent.frame(4))
}


set_colclass <- function(x, class){
  stopifnot(all(class %in% c("integer", "numeric", "double","factor","character")))
  for(i in intersect(names(class), names(x))){
    f <- get(paste0("as.", class[i]))
    x[, (..("i")):=..("f")(get(..("i")))]
  }
  invisible(x)
}

The function .. makes sure we get a variable out of the scope of data.table; set_colclass will set the classes of your cols. You can use it like this:

dt <- data.table(i=1:3,f=3:1)
set_colclass(dt, c(i="character"))
class(dt$i)

Here is the same way as @Nera suggested to check the class first but instead of using .SD is to use the fast loop of data.table with set as @Matt Dowle solution with added class check.

for (j in seq_len(ncol(DT))){
  if(class(DT[[j]]) == 'factor')
    set(DT, j = j, value = as.character(DT[[j]]))
}

Raising Matt Dowle's comment to Geneorama's answer (https://stackoverflow.com/a/20808945/4241780) to make it more obvious (as encouraged), you can use for(...)set(...).


library(data.table)

DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7, c = letters[1:4])
DT1 <- copy(DT)
names_factors <- c("a", "c")

for(col in names_factors)
  set(DT, j = col, value = as.factor(DT[[col]]))

sapply(DT, class)
#>         a         b         c 
#>  "factor" "integer"  "factor"

Created on 2020-02-12 by the reprex package (v0.3.0)

See another of Matt's comments at https://stackoverflow.com/a/33000778/4241780 for more info.

Edit.

As noted by Espen and in help(set), j may be "Column name(s) (character) or number(s) (integer) to be assigned value when column(s) already exist". So names_factors <- c(1L, 3L) will also work.


try:

dt <- data.table(A = c(1:5), 
                 B= c(11:15))

x <- ncol(dt)

for(i in 1:x) 
{
     dt[[i]] <- as.character(dt[[i]])
}