How to split data into training testing sets using sample function

Question

I ve just started using R and I m not sure how to incorporate my dataset with the following sample code    sample x  size  replace   FALSE  prob   NULL    I have a dataset that I need to put into a training  75   and testing  25   set  I m not sure what information I m supposed to put into the x and size  Is x the dataset file  and size how many samples I have

User · Answer

Below a function that create a list of sub-samples of the same size which is not exactly what you wanted but might prove usefull for others  In my case to create multiple classification trees on smaller samples to test overfitting    df split  lt - function  df  number     sizedf       lt - length df  1     bound        lt - sizedf number   list         lt - list      for  i in 1 number       list i   lt - list df   i bound 1 -bound   i bound           return list      Example    x  lt - matrix c 1 10   ncol 1  x     1     1      1    2      2    3      3    4      4    5      5    6      6    7      7    8      8    9      9   10     10  x split  lt - df split x 5  x split     1      1  1 2      2      1  3 4      3      1  5 6      4      1  7 8      5      1  9 10

User · Answer

I can suggest using the rsample package     choosing 75  of the data to be the training data data split  lt - initial split data  prop    75    extracting training data and test data as two seperate dataframes data train  lt - training data split  data test   lt - testing data split

User · Answer

require caTools   set seed 101              This is used to create same samples everytime  split1 sample split data anycol SplitRatio 2 3   train subset data split1  TRUE   test subset data split1  FALSE    The sample split   function will add one extra column  split1  to dataframe and 2 3 of the rows will have this value as TRUE and others as FALSE Now the rows where split1 is TRUE will be copied into train and other rows will be copied to test dataframe

User · Answer

This is almost the same code  but in more nice look  bound  lt - floor  nrow df  4  3           define   of training and test set  df  lt - df sample nrow df                 sample rows  df train  lt - df 1 bound                  get training set df test  lt - df  bound 1  nrow df         get test set

User · Answer

Assuming df is your data frame  and that you want to create 75  train and 25  test  all  lt - 1 nrow df  train i  lt - sort sample all  round nrow df  0 75 digits   0  replace FALSE   test i  lt - all -train i    Then to create a train and test data frames  df train  lt - df train i   df test  lt - df test i

User · Answer

I think this would solve the problem  df   data frame read csv  quot data csv quot      Split the dataset into 80-20 numberOfRows   nrow df  bound   as integer numberOfRows  0 8  train df 1 bound  2  test1  df  bound 1  numberOfRows  2

User · Answer

Use base R  Function runif generates uniformly distributed values from 0 to 1 By varying cutoff value  train size in example below   you will always have approximately the same percentage of random records below the cutoff value   data mtcars  set seed 123    desired proportion of records in training set train size lt - 7  true false vector of values above below the cutoff above train ind lt -runif nrow mtcars   lt train size   train train df lt -mtcars train ind      test test df lt -mtcars  train ind

User · Answer

Beware of sample for splitting if you look for reproducible results  If your data changes even slightly  the split will vary even if you use set seed  For example  imagine the sorted list of IDs in you data is all the numbers between 1 and 10  If you just dropped one observation  say 4  sampling by location would yield a different results because now 5 to 10 all moved places    An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers  This sample is more stable because assignment is now determined by the hash of each observation  and not by its relative position   For example   require openssl     for md5 require data table     for the demo data  set seed 1     this won t help  sample   population  lt - as character 1e5  1e6-1      some made up ID names  N  lt - 1e4    sample size  sample1  lt - data table id   sort sample population  N       randomly sample N ids sample2  lt - sample1 -sample N  1      randomly drop one observation from sample1    samples are all but identical sample1 sample2 nrow merge sample1  sample2      1  9999    row splitting yields very different test sets  even though we ve set the seed test  lt - sample N-1  N 2  replace   F   test1  lt - sample1 test    id   test2  lt - sample2 test    id   nrow test1     1  5000  nrow merge test1  test2      1  2653    to fix that  we can use some hash function to sample on the last digit  md5 bit mod  lt - function x  m   2L        Inputs        x  a character vector of ids      m  the modulo divisor  modify for split proportions other than 50 50      Output  remainders from dividing the first digit of the md5 hash of x by m   as integer as hexmode substr openssl  md5 x   1  1      m       hash splitting preserves the similarity  because the assignment of test train    is determined by the hash of each obs   and not by its relative location in the data   which may change  test1a  lt - sample1 md5 bit mod id     0L    id   test2a  lt - sample2 md5 bit mod id     0L    id   nrow merge test1a  test2a      1  5057  nrow test1a     1  5057  sample size is not exactly 5000 because assignment is probabilistic  but it shouldn t be a problem in large samples thanks to the law of large numbers   See also  http   blog richardweiss org 2016 12 25 hash-splits html and https   crypto stackexchange com questions 20742 statistical-properties-of-hash-functions-when-calculating-modulo

User · Answer

If you type     sample   If will launch a help menu to explain what the parameters of the sample function mean    I am not an expert  but here is some code I have   data  lt - data frame matrix rnorm 400   nrow 100   splitdata  lt - split data 1 nrow data    sample rep 1 4 as integer nrow data  4     test  lt - splitdata  1   train  lt - rbind splitdata  1   splitdata  2   splitdata  3      This will give you 75  train and 25  test

User · Answer

There are numerous approaches to achieve data partitioning  For a more complete approach take a look at the createDataPartition function in the caTools package   Here is a simple example   data mtcars      75  of the sample size smp size  lt - floor 0 75   nrow mtcars       set the seed to make your partition reproducible set seed 123  train ind  lt - sample seq len nrow mtcars    size   smp size   train  lt - mtcars train ind    test  lt - mtcars -train ind

User · Answer

My solution is basically the same as dickoa s but a little easier to interpret   data mtcars  n   nrow mtcars  trainIndex   sample 1 n  size   round 0 7 n   replace FALSE  train   mtcars trainIndex    test   mtcars -trainIndex

User · Answer

My solution shuffles the rows  then takes the first 75  of the rows as train and the last 25  as test  Super simples   row count  lt - nrow orders pivotted  shuffled rows  lt - sample row count  train  lt - orders pivotted head shuffled rows floor row count 0 75     test  lt - orders pivotted tail shuffled rows floor row count 0 25

User · Answer

I will split  a  into train 70   and test 30        a   original data frame     library dplyr      train lt -sample frac a  0 7      sid lt -as numeric rownames train     because rownames   returns character     test lt -a -sid     done

User · Answer

I bumped into this one  it can help too   set seed 12  data   Sonar sample nrow Sonar     reshufles the data bound   floor 0 7   nrow data   df train   data 1 bound   df test   data  bound 1  nrow data

User · Answer

Use caTools package in R  sample code will be as follows -  data split   sample split data DependentcoloumnName  SplitRatio   0 6  training set   subset data  split    TRUE  test set   subset data  split    FALSE

User · Answer

After looking through all the different methods posted here  I didn t see anyone utilize TRUE FALSE to select and unselect data  So I thought I would share a method utilizing that technique   n   nrow dataset  split   sample c TRUE  FALSE   n  replace TRUE  prob c 0 75  0 25    training   dataset split    testing   dataset  split      Explanation  There are multiple ways of selecting data from R  most commonly people use positive negative indices to select unselect respectively  However  the same functionalities can be achieved by using TRUE FALSE to select unselect   Consider the following example     let s explore ways to select every other element data   c 1  2  3  4  5      using positive indices to select wanted elements data c 1  3  5    1  1 3 5    using negative indices to remove unwanted elements data c -2  -4    1  1 3 5    using booleans to select wanted elements data c TRUE  FALSE  TRUE  FALSE  TRUE    1  1 3 5    R recycles the TRUE FALSE vector if it is not the correct dimension data c TRUE  FALSE    1  1 3 5

User · Answer

I would use dplyr for this  makes it super simple  It does require an id variable in your data set  which is a good idea anyway  not only for creating sets but also for traceability during your project  Add it if doesn t contain already   mtcars id  lt - 1 nrow mtcars  train  lt - mtcars   gt   dplyr  sample frac  75  test   lt - dplyr  anti join mtcars  train  by    id

User · Answer

We can divide data into a particular ratio here it is 80  train and 20  in a test dataset    ind  lt - sample 2  nrow dataName   replace   T  prob   c 0 8 0 2   train  lt - dataName ind  1    test  lt - dataName ind  2

User · Answer

There is a very simple way to select a number of rows using the R index for rows and columns  This lets you CLEANLY split the data set given a number of rows - say the 1st 80  of your data  In R all rows and columns are indexed so DataSetName 1 1  is the value assigned to the first column and first row of  quot DataSetName quot   I can select rows using  x   and columns using   x  For example  If I have a data set conveniently named  quot data quot  with 100 rows I can view the first 80 rows using  View data 1 80     In the same way I can select these rows and subset them using   train   data 1 80   test   data 81 100    Now I have my data split into two parts without the possibility of resampling  Quick and easy

User · Answer

scorecard package has a useful function for that  where you can specify the ratio and seed  library scorecard   dt list  lt - split df mtcars  ratio   0 75  seed   66    The test and train data are stored in a list and can be accessed by calling dt list train and dt list test

User · Answer

It can be easily done by   set seed 101    Set Seed so that same sample can be reproduced in future also   Now Selecting 75  of data as sample from total  n  rows of the data   sample  lt - sample int n   nrow data   size   floor  75 nrow data    replace   F  train  lt - data sample    test   lt - data -sample      By using caTools package   require caTools  set seed 101   sample   sample split data anycolumn  SplitRatio    75  train   subset data  sample    TRUE  test    subset data  sample    FALSE

User · Answer

library caret  intrain lt -createDataPartition y sub train classe p 0 7 list FALSE  training lt -m train intrain   testing lt -m train -intrain

User · Answer

Just a more brief and simple way using awesome dplyr library   library dplyr  set seed 275   to get repeatable data  data train  lt - sample frac Default  0 7   train index  lt - as numeric rownames data train   data test  lt - Default -train index

User · Answer

set seed 123  llwork lt -sample 1 length mydata  round 0 75 length mydata  digits 0    wmydata lt -mydata llwork    tmydata lt -mydata -llwork

[r] How to split data into training/testing sets using sample function

Examples related to r

Examples related to sample