[r] Split a large dataframe into a list of data frames based on common value in column

I have a data frame with 10 columns, collecting actions of "users", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the data frame is about 750000 rows. I am trying to extract individual data frames (so getting a list or vector of data frames) split by the column containing the "user" identifier, to isolate the actions of a single actor.

ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
4  | aad   | bb4   | ... | u_002

resulting into

list(
ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
,
4  | aad   | bb4   | ... | u_002
...)

The following works very well for me on a small sample (1000 rows):

paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x)

and then accessing the element I want by paths[1] for instance.

When applying on the original large data frame or even a matrix representation, this chokes my machine ( 4GB RAM, MacOSX 10.6, R 2.15) and never completes (I know that a newer R version exists, but I believe this is not the main problem).

It seems that split is more performant and after a long time completes, but I do not know ( inferior R knowledge) how to piece the resulting list of vectors into a vector of matrices.

path = split(smallsampleMat, smallsampleMat[,10]) 

I have considered also using big.matrix etc, but without much success that would speed up the process.

This question is related to r performance matrix split dataframe

The answer is


From version 0.8.0, dplyr offers a handy function called group_split():

# On sample data from @Aus_10

df %>%
  group_split(g)

[[1]]
# A tibble: 25 x 3
   ran_data1 ran_data2 g    
       <dbl>     <dbl> <fct>
 1     2.04      0.627 A    
 2     0.530    -0.703 A    
 3    -0.475     0.541 A    
 4     1.20     -0.565 A    
 5    -0.380    -0.126 A    
 6     1.25     -1.69  A    
 7    -0.153    -1.02  A    
 8     1.52     -0.520 A    
 9     0.905    -0.976 A    
10     0.517    -0.535 A    
# … with 15 more rows

[[2]]
# A tibble: 25 x 3
   ran_data1 ran_data2 g    
       <dbl>     <dbl> <fct>
 1     1.61      0.858 B    
 2     1.05     -1.25  B    
 3    -0.440    -0.506 B    
 4    -1.17      1.81  B    
 5     1.47     -1.60  B    
 6    -0.682    -0.726 B    
 7    -2.21      0.282 B    
 8    -0.499     0.591 B    
 9     0.711    -1.21  B    
10     0.705     0.960 B    
# … with 15 more rows

To not include the grouping column:

df %>%
 group_split(g, keep = FALSE)

Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.

df <- data.frame(
     ran_data1=rnorm(125),
     ran_data2=rnorm(125),
     g=rep(factor(LETTERS[1:5]), 25)
 )

test_x = split(df,df$g)[['A']]
test_y = split(df,df$g!='A')[['TRUE']]

Here's what it looks like:

head(test_x)
            x          y g
1   1.1362198  1.2969541 A
6   0.5510307 -0.2512449 A
11  0.0321679  0.2358821 A
16  0.4734277 -1.2889081 A
21 -1.2686151  0.2524744 A

> head(test_y)
            x          y g
2 -2.23477293  1.1514810 B
3 -0.46958938 -1.7434205 C
4  0.07365603  0.1111419 D
5 -1.08758355  0.4727281 E
7  0.28448637 -1.5124336 B
8  1.24117504  0.4928257 C

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to matrix

How to get element-wise matrix multiplication (Hadamard product) in numpy? How can I plot a confusion matrix? Error: stray '\240' in program What does the error "arguments imply differing number of rows: x, y" mean? How to input matrix (2D list) in Python? Difference between numpy.array shape (R, 1) and (R,) Counting the number of non-NaN elements in a numpy ndarray in Python Inverse of a matrix using numpy How to create an empty matrix in R? numpy matrix vector multiplication

Examples related to split

Parameter "stratify" from method "train_test_split" (scikit Learn) Pandas split DataFrame by column value How to split large text file in windows? Attribute Error: 'list' object has no attribute 'split' Split function in oracle to comma separated values with automatic sequence How would I get everything before a : in a string Python Split String by delimiter position using oracle SQL JavaScript split String with white space Split a String into an array in Swift? Split pandas dataframe in two if it has more than 10 rows

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe