I have a data frame with 10 columns, collecting actions of "users", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the data frame is about 750000 rows. I am trying to extract individual data frames (so getting a list or vector of data frames) split by the column containing the "user" identifier, to isolate the actions of a single actor.
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | ... | u_001
3 | aac | bb3 | ... | u_001
4 | aad | bb4 | ... | u_002
resulting into
list(
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | ... | u_001
3 | aac | bb3 | ... | u_001
,
4 | aad | bb4 | ... | u_002
...)
The following works very well for me on a small sample (1000 rows):
paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x)
and then accessing the element I want by paths[1] for instance.
When applying on the original large data frame or even a matrix representation, this chokes my machine ( 4GB RAM, MacOSX 10.6, R 2.15) and never completes (I know that a newer R version exists, but I believe this is not the main problem).
It seems that split is more performant and after a long time completes, but I do not know ( inferior R knowledge) how to piece the resulting list of vectors into a vector of matrices.
path = split(smallsampleMat, smallsampleMat[,10])
I have considered also using big.matrix
etc, but without much success that would speed up the process.
This question is related to
r
performance
matrix
split
dataframe
From version 0.8.0, dplyr
offers a handy function called group_split()
:
# On sample data from @Aus_10
df %>%
group_split(g)
[[1]]
# A tibble: 25 x 3
ran_data1 ran_data2 g
<dbl> <dbl> <fct>
1 2.04 0.627 A
2 0.530 -0.703 A
3 -0.475 0.541 A
4 1.20 -0.565 A
5 -0.380 -0.126 A
6 1.25 -1.69 A
7 -0.153 -1.02 A
8 1.52 -0.520 A
9 0.905 -0.976 A
10 0.517 -0.535 A
# … with 15 more rows
[[2]]
# A tibble: 25 x 3
ran_data1 ran_data2 g
<dbl> <dbl> <fct>
1 1.61 0.858 B
2 1.05 -1.25 B
3 -0.440 -0.506 B
4 -1.17 1.81 B
5 1.47 -1.60 B
6 -0.682 -0.726 B
7 -2.21 0.282 B
8 -0.499 0.591 B
9 0.711 -1.21 B
10 0.705 0.960 B
# … with 15 more rows
To not include the grouping column:
df %>%
group_split(g, keep = FALSE)
Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.
df <- data.frame(
ran_data1=rnorm(125),
ran_data2=rnorm(125),
g=rep(factor(LETTERS[1:5]), 25)
)
test_x = split(df,df$g)[['A']]
test_y = split(df,df$g!='A')[['TRUE']]
Here's what it looks like:
head(test_x)
x y g
1 1.1362198 1.2969541 A
6 0.5510307 -0.2512449 A
11 0.0321679 0.2358821 A
16 0.4734277 -1.2889081 A
21 -1.2686151 0.2524744 A
> head(test_y)
x y g
2 -2.23477293 1.1514810 B
3 -0.46958938 -1.7434205 C
4 0.07365603 0.1111419 D
5 -1.08758355 0.4727281 E
7 0.28448637 -1.5124336 B
8 1.24117504 0.4928257 C
Source: Stackoverflow.com