I want to repeat the rows of a data.frame, each N
times. The result should be a new data.frame
(with nrow(new.df) == nrow(old.df) * N
) keeping the data types of the columns.
Example for N = 2:
A B C
A B C 1 j i 100
1 j i 100 --> 2 j i 100
2 K P 101 3 K P 101
4 K P 101
So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...
My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N))
, but this one transforms my values to characters and I get:
A B C
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"
try using for example
N=2
rep(1:4, each = N)
as an index
A clean dplyr
solution, taken from here
library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))
Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:
df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]
Although the other solutions may be shorter, this method may be more advantageous in certain situations.
For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame()
in case you don't want to include the whole package:
> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f
Adding to what @dardisco mentioned about mefa::rep.data.frame()
, it's very flexible.
You can either repeat each row N times:
rep(df, each=N)
or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)
rep(df, times=N)
Two thumbs up for mefa
! I had never heard of it until now and I had to write manual code to do this.
If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:
library(mefa)
rep(mtcars,10)
or simply
mefa:::rep.data.frame(mtcars)
The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:
library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
My solution similar as mefa:::rep.data.frame
, but a little faster and cares about row names:
rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}
Compare solutions:
library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749 278.1066 356.3210 10 a
#> rep.data.frame(Batting, 10) 79.70335 82.8165 134.0974 87.2587 191.1713 307.4567 10 a
#> Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927 10 b
There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes
column to your data frame:
A B C ntimes
1 j i 100 2
2 K P 101 4
3 Z Z 102 1
Method:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))
Result:
A B C ntimes
1 Z Z 102 1
2 j i 100 2
3 j i 100 2
4 K P 101 4
5 K P 101 4
6 K P 101 4
7 K P 101 4
This is very similar to Josh O'Brien and Mark Miller's method:
df[rep(seq_len(nrow(df)), df$ntimes),]
However, that method appears quite a bit slower:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))
microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)
Result:
Unit: microseconds
expr min lq mean median uq max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757 10
as.data.frame(lapply(df, rep, df$ntimes)) 625.552 654.638 676.4067 668.094 681.929 799.893 10
Source: Stackoverflow.com