[r] Sample random rows in dataframe

I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?

This question is related to r dataframe random r-faq

The answer is


EDIT: This answer is now outdated, see the updated version.

In my R package I have enhanced sample so that it now behaves as expected also for data frames:

library(devtools); install_github('kimisc', 'krlmlr')

library(kimisc)
example(sample.data.frame)

smpl..> set.seed(42)

smpl..> sample(data.frame(a=c(1,2,3), b=c(4,5,6),
                           row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

This is achieved by making sample an S3 generic method and providing the necessary (trivial) functionality in a function. A call to setMethod fixes everything. The original implementation still can be accessed through base::sample.


Write one! Wrapping JC's answer gives me:

randomRows = function(df,n){
   return(df[sample(nrow(df),n),])
}

Now make it better by checking first if n<=nrow(df) and stopping with an error.


I'm new in R, but I was using this easy method that works for me:

sample_of_diamonds <- diamonds[sample(nrow(diamonds),100),]

PS: Feel free to note if it has some drawback I'm not thinking about.


You could do this:

library(dplyr)

cols <- paste0("a", 1:10)
tab <- matrix(1:1000, nrow = 100) %>% as.tibble() %>% set_names(cols)
tab
# A tibble: 100 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     1   101   201   301   401   501   601   701   801   901
 2     2   102   202   302   402   502   602   702   802   902
 3     3   103   203   303   403   503   603   703   803   903
 4     4   104   204   304   404   504   604   704   804   904
 5     5   105   205   305   405   505   605   705   805   905
 6     6   106   206   306   406   506   606   706   806   906
 7     7   107   207   307   407   507   607   707   807   907
 8     8   108   208   308   408   508   608   708   808   908
 9     9   109   209   309   409   509   609   709   809   909
10    10   110   210   310   410   510   610   710   810   910
# ... with 90 more rows

Above I just made a dataframe with 10 columns and 100 rows, ok?

Now you can sample it with sample_n:

sample_n(tab, size = 800, replace = T)
# A tibble: 800 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1    53   153   253   353   453   553   653   753   853   953
 2    14   114   214   314   414   514   614   714   814   914
 3    10   110   210   310   410   510   610   710   810   910
 4    70   170   270   370   470   570   670   770   870   970
 5    36   136   236   336   436   536   636   736   836   936
 6    77   177   277   377   477   577   677   777   877   977
 7    13   113   213   313   413   513   613   713   813   913
 8    58   158   258   358   458   558   658   758   858   958
 9    29   129   229   329   429   529   629   729   829   929
10     3   103   203   303   403   503   603   703   803   903
# ... with 790 more rows

Outdated answer. Please use dplyr::sample_frac() or dplyr::sample_n() instead.

In my R package there is a function sample.rows just for this purpose:

install.packages('kimisc')

library(kimisc)
example(sample.rows)

smpl..> set.seed(42)

smpl..> sample.rows(data.frame(a=c(1,2,3), b=c(4,5,6),
                               row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

Enhancing sample by making it a generic S3 function was a bad idea, according to comments by Joris Meys to a previous answer.


Just for completeness sake:

dplyr also offers to draw a proportion or fraction of the sample by

df %>% sample_frac(0.33)

This is very convenient e.g. in machine learning when you have to do a certain split ratio like 80%:20%


The data.table package provides the function DT[sample(.N, M)], sampling M random rows from the data table DT.

library(data.table)
set.seed(10)

mtcars <- data.table(mtcars)
mtcars[sample(.N, 6)]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
2: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
3: 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
5: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
6: 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

The answer John Colby gives is the right answer. However if you are a dplyr user there is also the answer sample_n:

sample_n(df, 10)

randomly samples 10 rows from the dataframe. It calls sample.int, so really is the same answer with less typing (and simplifies use in the context of magrittr since the dataframe is the first argument).


You could do this:

sample_data = data[sample(nrow(data), sample_size, replace = FALSE), ]

Select a Random sample from a tibble type in R:

library("tibble")    
a <- your_tibble[sample(1:nrow(your_tibble), 150),]

nrow takes a tibble and returns the number of rows. The first parameter passed to sample is a range from 1 to the end of your tibble. The second parameter passed to sample, 150, is how many random samplings you want. The square bracket slicing specifies the rows of the indices returned. Variable 'a' gets the value of the random sampling.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to random

How can I get a random number in Kotlin? scikit-learn random state in splitting dataset Random number between 0 and 1 in python In python, what is the difference between random.uniform() and random.random()? Generate random colors (RGB) Random state (Pseudo-random number) in Scikit learn How does one generate a random number in Apple's Swift language? How to generate a random string of a fixed length in Go? Generate 'n' unique random numbers within a range What does random.sample() method in python do?

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc