Split a large dataframe into a list of data frames based on common value in column

Question

I have a data frame with 10 columns  collecting actions of  users   where one of the columns contains an ID  not unique  identifying user  column 10   the length of the data frame is about 750000 rows  I am trying to extract individual data frames  so getting a list or vector of data frames  split by the column containing the  user  identifier  to isolate the actions of a single actor   ID   Data1   Data2         UserID 1    aaa     bbb           u 001 2    aab     bb2           u 001 3    aac     bb3           u 001 4    aad     bb4           u 002   resulting into   list  ID   Data1   Data2         UserID 1    aaa     bbb           u 001 2    aab     bb2           u 001 3    aac     bb3           u 001   4    aad     bb4           u 002        The following works very well for me on a small sample  1000 rows    paths   by smallsampleMat  smallsampleMat   userID    function x  x    and then accessing the element I want by paths 1  for instance   When applying on the original large data frame or even a matrix representation  this chokes my machine   4GB RAM  MacOSX 10 6  R 2 15  and never completes  I know that a newer R version exists  but I believe this is not the main problem     It seems that split is more performant and after a long time completes  but I do not know   inferior R knowledge  how to piece the resulting list of vectors into a vector of matrices   path   split smallsampleMat  smallsampleMat  10      I have considered also using big matrix etc  but without much success that would speed up the process

User · Answer

From version 0 8 0  dplyr offers a handy function called group split      On sample data from  Aus 10  df   gt     group split g     1     A tibble  25 x 3    ran data1 ran data2 g             lt dbl gt       lt dbl gt   lt fct gt   1     2 04      0 627 A      2     0 530    -0 703 A      3    -0 475     0 541 A      4     1 20     -0 565 A      5    -0 380    -0 126 A      6     1 25     -1 69  A      7    -0 153    -1 02  A      8     1 52     -0 520 A      9     0 905    -0 976 A     10     0 517    -0 535 A           with 15 more rows    2     A tibble  25 x 3    ran data1 ran data2 g             lt dbl gt       lt dbl gt   lt fct gt   1     1 61      0 858 B      2     1 05     -1 25  B      3    -0 440    -0 506 B      4    -1 17      1 81  B      5     1 47     -1 60  B      6    -0 682    -0 726 B      7    -2 21      0 282 B      8    -0 499     0 591 B      9     0 711    -1 21  B     10     0 705     0 960 B           with 15 more rows  To not include the grouping column  df   gt    group split g  keep   FALSE

User · Answer

You can just as easily access each element in the list using e g  path  1    You can t put a set of matrices into an atomic vector and access each element  A matrix is an atomic vector with dimension attributes  I would use the list structure returned by split  it s what it was designed for  Each list element can hold data of different types and sizes so it s very versatile and you can use  apply functions to further operate on each element in the list  Example below      For reproducibile data set seed 1      Make some data userid  lt - rep 1 2 times 4  data1  lt - replicate 8   paste  sample letters   3     collapse          data2  lt - sample 10 8  df  lt - data frame  userid   data1   data2       Split on userid out  lt - split  df   f   df userid      1     userid data1 data2  1      1   gjn     3  3      1   yqp     1  5      1   rjs     6  7      1   jtw     5     2     userid data1 data2  2      2   xfv     4  4      2   bfe    10  6      2   mrx     2  8      2   fqd     9   Access each element using the    operator like this   out  1      userid data1 data2  1      1   gjn     3  3      1   yqp     1  5      1   rjs     6  7      1   jtw     5   Or use an  apply function to do further operations on each list element  For instance  to take the mean of the data2 column you could use sapply like this   sapply  out   function x  mean  x data2         1    2   3 75 6 25

User · Answer

Stumbled across this answer and I actually wanted BOTH groups  data containing that one user and data containing everything but that one user   Not necessary for the specifics of this post  but I thought I would add in case someone was googling the same issue as me   df  lt - data frame       ran data1 rnorm 125        ran data2 rnorm 125        g rep factor LETTERS 1 5    25      test x   split df df g    A    test y   split df df g   A     TRUE      Here s what it looks like   head test x              x          y g 1   1 1362198  1 2969541 A 6   0 5510307 -0 2512449 A 11  0 0321679  0 2358821 A 16  0 4734277 -1 2889081 A 21 -1 2686151  0 2524744 A   gt  head test y              x          y g 2 -2 23477293  1 1514810 B 3 -0 46958938 -1 7434205 C 4  0 07365603  0 1111419 D 5 -1 08758355  0 4727281 E 7  0 28448637 -1 5124336 B 8  1 24117504  0 4928257 C

[r] Split a large dataframe into a list of data frames based on common value in column

Examples related to r

Examples related to performance

Examples related to matrix

Examples related to split

Examples related to dataframe