[list] data.frame rows to a list

I have a data.frame which I would like to convert to a list by rows, meaning each row would correspond to its own list elements. In other words, I would like a list that is as long as the data.frame has rows.

So far, I've tackled this problem in the following manner, but I was wondering if there's a better way to approach this.

xy.df <- data.frame(x = runif(10),  y = runif(10))

# pre-allocate a list and fill it with a loop
xy.list <- vector("list", nrow(xy.df))
for (i in 1:nrow(xy.df)) {
    xy.list[[i]] <- xy.df[i,]
}

This question is related to list r dataframe

The answer is


Seems a current version of the purrr (0.2.2) package is the fastest solution:

by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out

Let's compare the most interesting solutions:

data("Batting", package = "Lahman")
x <- Batting[1:10000, 1:10]
library(benchr)
library(purrr)
benchmark(
    split = split(x, seq_len(.row_names_info(x, 2L))),
    mapply = .mapply(function(...) structure(list(...), class = "data.frame", row.names = 1L), x, NULL),
    purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
)

Rsults:

Benchmark summary:
Time units : milliseconds 
  expr n.eval   min  lw.qu median   mean  up.qu  max  total relative
 split    100 983.0 1060.0 1130.0 1130.0 1180.0 1450 113000     34.3
mapply    100 826.0  894.0  963.0  972.0 1030.0 1320  97200     29.3
 purrr    100  24.1   28.6   32.9   44.9   40.5  183   4490      1.0

Also we can get the same result with Rcpp:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List df2list(const DataFrame& x) {
    std::size_t nrows = x.rows();
    std::size_t ncols = x.cols();
    CharacterVector nms = x.names();
    List res(no_init(nrows));
    for (std::size_t i = 0; i < nrows; ++i) {
        List tmp(no_init(ncols));
        for (std::size_t j = 0; j < ncols; ++j) {
            switch(TYPEOF(x[j])) {
                case INTSXP: {
                    if (Rf_isFactor(x[j])) {
                        IntegerVector t = as<IntegerVector>(x[j]);
                        RObject t2 = wrap(t[i]);
                        t2.attr("class") = "factor";
                        t2.attr("levels") = t.attr("levels");
                        tmp[j] = t2;
                    } else {
                        tmp[j] = as<IntegerVector>(x[j])[i];
                    }
                    break;
                }
                case LGLSXP: {
                    tmp[j] = as<LogicalVector>(x[j])[i];
                    break;
                }
                case CPLXSXP: {
                    tmp[j] = as<ComplexVector>(x[j])[i];
                    break;
                }
                case REALSXP: {
                    tmp[j] = as<NumericVector>(x[j])[i];
                    break;
                }
                case STRSXP: {
                    tmp[j] = as<std::string>(as<CharacterVector>(x[j])[i]);
                    break;
                }
                default: stop("Unsupported type '%s'.", type2name(x));
            }
        }
        tmp.attr("class") = "data.frame";
        tmp.attr("row.names") = 1;
        tmp.attr("names") = nms;
        res[i] = tmp;
    }
    res.attr("names") = x.attr("row.names");
    return res;
}

Now caompare with purrr:

benchmark(
    purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out,
    rcpp = df2list(x)
)

Results:

Benchmark summary:
Time units : milliseconds 
 expr n.eval  min lw.qu median mean up.qu   max total relative
purrr    100 25.2  29.8   37.5 43.4  44.2 159.0  4340      1.1
 rcpp    100 19.0  27.9   34.3 35.8  37.2  93.8  3580      1.0

The best way for me was:

Example data:

Var1<-c("X1",X2","X3")
Var2<-c("X1",X2","X3")
Var3<-c("X1",X2","X3")

Data<-cbind(Var1,Var2,Var3)

ID    Var1   Var2  Var3 
1      X1     X2    X3
2      X4     X5    X6
3      X7     X8    X9

We call the BBmisc library

library(BBmisc)

data$lists<-convertRowsToList(data[,2:4])

And the result will be:

ID    Var1   Var2  Var3  lists
1      X1     X2    X3   list("X1", "X2", X3") 
2      X4     X5    X6   list("X4","X5", "X6") 
3      X7     X8    X9   list("X7,"X8,"X9) 

A couple of more options :

With asplit

asplit(xy.df, 1)
#[[1]]
#     x      y 
#0.1137 0.6936 

#[[2]]
#     x      y 
#0.6223 0.5450 

#[[3]]
#     x      y 
#0.6093 0.2827 
#....

With split and row

split(xy.df, row(xy.df)[, 1])

#$`1`
#       x      y
#1 0.1137 0.6936

#$`2`
#       x     y
#2 0.6223 0.545

#$`3`
#       x      y
#3 0.6093 0.2827
#....

data

set.seed(1234)
xy.df <- data.frame(x = runif(10),  y = runif(10))

If you want to completely abuse the data.frame (as I do) and like to keep the $ functionality, one way is to split you data.frame into one-line data.frames gathered in a list :

> df = data.frame(x=c('a','b','c'), y=3:1)
> df
  x y
1 a 3
2 b 2
3 c 1

# 'convert' into a list of data.frames
ldf = lapply(as.list(1:dim(df)[1]), function(x) df[x[1],])

> ldf
[[1]]
x y
1 a 3    
[[2]]
x y
2 b 2
[[3]]
x y
3 c 1

# and the 'coolest'
> ldf[[2]]$y
[1] 2

It is not only intellectual masturbation, but allows to 'transform' the data.frame into a list of its lines, keeping the $ indexation which can be useful for further use with lapply (assuming the function you pass to lapply uses this $ indexation)


Eureka!

xy.list <- as.list(as.data.frame(t(xy.df)))

An alternative way is to convert the df to a matrix then applying the list apply lappy function over it: ldf <- lapply(as.matrix(myDF), function(x)x)


The by_row function from the purrrlyr package will do this for you.

This example demonstrates

myfn <- function(row) {
  #row is a tibble with one row, and the same number of columns as the original df
  l <- as.list(row)
  return(l)
}

list_of_lists <- purrrlyr::by_row(df, myfn, .labels=FALSE)$.out

By default, the returned value from myfn is put into a new list column in the df called .out. The $.out at the end of the above statement immediately selects this column, returning a list of lists.


Like @flodel wrote: This converts your dataframe into a list that has the same number of elements as number of rows in dataframe:

NewList <- split(df, f = seq(nrow(df)))

You can additionaly add a function to select only those columns that are not NA in each element of the list:

NewList2 <- lapply(NewList, function(x) x[,!is.na(x)])

Another alternative using library(purrr) (that seems to be a bit quicker on large data.frames)

flatten(by_row(xy.df, ..f = function(x) flatten_chr(x), .labels = FALSE))

I was working on this today for a data.frame (really a data.table) with millions of observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.

Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat))) for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (iris repeated 50 times).

library(data.table)
library(microbenchmark)

microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
                  function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
           function(i) {
             tmp <- lapply(dat, "[", i)
             attr(tmp, "class") <- c("data.table", "data.frame")
             setDF(tmp)
           })},
datList = {datL <- lapply(seq_len(nrow(dat)),
                          function(i) lapply(dat, "[", i))},
times=20
) 

This returns

Unit: milliseconds
       expr      min       lq     mean   median        uq       max neval
      split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150    20
      setDF 459.0577 466.3432 511.2656 482.1943  500.6958  750.6635    20
     attrDT 399.1999 409.6316 461.6454 422.5436  490.5620  717.6355    20
    datList 192.1175 201.9896 241.4726 208.4535  246.4299  411.2097    20

While the differences are not as large as in my previous test, the straight setDF method is significantly faster at all levels of the distribution of runs with max(setDF) < min(split) and the attr method is typically more than twice as fast.

A fourth method is the extreme champion, which is a simple nested lapply, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.frame function were roughly an order of magnitude slower than the data.table techniques.

data

dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))

A more modern solution uses only purrr::transpose:

library(purrr)
iris[1:2,] %>% purrr::transpose()
#> [[1]]
#> [[1]]$Sepal.Length
#> [1] 5.1
#> 
#> [[1]]$Sepal.Width
#> [1] 3.5
#> 
#> [[1]]$Petal.Length
#> [1] 1.4
#> 
#> [[1]]$Petal.Width
#> [1] 0.2
#> 
#> [[1]]$Species
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$Sepal.Length
#> [1] 4.9
#> 
#> [[2]]$Sepal.Width
#> [1] 3
#> 
#> [[2]]$Petal.Length
#> [1] 1.4
#> 
#> [[2]]$Petal.Width
#> [1] 0.2
#> 
#> [[2]]$Species
#> [1] 1

Examples related to list

Convert List to Pandas Dataframe Column Python find elements in one list that are not in the other Sorting a list with stream.sorted() in Java Python Loop: List Index Out of Range How to combine two lists in R How do I multiply each element in a list by a number? Save a list to a .txt file The most efficient way to remove first N elements in a list? TypeError: list indices must be integers or slices, not str Parse JSON String into List<string>

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe