How do I create test and train samples from one dataframe with pandas

Question

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples  80  and 20   for training and testing   Thanks

User · Answer

There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library     set the random seed for the reproducibility np random seed 17     e g  number of samples for the training set is 1000 n train   1000    shuffle the indexes shuffled indexes   np arange len data df   np random shuffle shuffled indexes     use  n train  samples for training and the rest for testing train ids   shuffled indexes  n train  test ids   shuffled indexes n train    train data   data df iloc train ids  train labels   labels df iloc train ids   test data   data df iloc test ids  test labels   data df iloc test ids

User · Answer

You can make use of df as matrix   function and create Numpy-array and pass it   Y   df pop   X   df as matrix   x train  x test  y train  y test   train test split X  Y  test size   0 2  model fit x train  y train  model test x test

User · Answer

A bit more elegant to my taste is to create a random column and then split by it  this way we can get a split that will suit our needs and will be random    def split df df  p  0 8  0 2    import numpy as np df  rand   np random choice len p   len df   p p  r    df df  rand    val  for val in df  rand   unique    return r

User · Answer

I think you also need to a get a copy not a slice of dataframe if you wanna add columns later   msk   np random rand len df    lt  0 8 train  test   df msk  copy deep   True   df  msk  copy deep   True

User · Answer

How about this  df is my dataframe  total size len df   train size math floor 0 66 total size   2 3 part of my dataset    training dataset train df head train size   test dataset test df tail len df  -train size

User · Answer

There are many ways to create a train test and even validation samples   Case 1  classic way train test split without any options   from sklearn model selection import train test split train  test   train test split df  test size 0 3    Case 2  case of a very small datasets   lt 500 rows   in order to get results for all your lines with this cross-validation  At the end  you will have one prediction for each line of your available training set   from sklearn model selection import KFold kf   KFold n splits 10  random state 0  y hat all      for train index  test index in kf split X  y       reg   RandomForestRegressor n estimators 50  random state 0      X train  X test   X train index   X test index      y train  y test   y train index   y test index      clf   reg fit X train  y train      y hat   clf predict X test      y hat all append y hat    Case 3a  Unbalanced datasets for classification purpose  Following the case 1  here is the equivalent solution   from sklearn model selection import train test split X train  X test  y train  y test   train test split X  y  stratify y  test size 0 3    Case 3b  Unbalanced datasets for classification purpose  Following the case 2  here is the equivalent solution   from sklearn model selection import StratifiedKFold kf   StratifiedKFold n splits 10  random state 0  y hat all      for train index  test index in kf split X  y       reg   RandomForestRegressor n estimators 50  random state 0      X train  X test   X train index   X test index      y train  y test   y train index   y test index      clf   reg fit X train  y train      y hat   clf predict X test      y hat all append y hat    Case 4  you need to create a train test validation sets on big data to tune hyperparameters  60  train  20  test and 20  val    from sklearn model selection import train test split X train  X test val  y train  y test val   train test split X  y  test size 0 6  X test  X val  y test  y val   train test split X test val  y test val  stratify y  test size 0 5

User · Answer

scikit learn s train test split is a good one - it will split both numpy arrays as dataframes  from sklearn model selection import train test split  train  test   train test split df  test size 0 2

User · Answer

You may also consider stratified division into training and testing set  Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved  This makes training and testing sets better reflect the properties of the original dataset   import numpy as np    def get train test inds y train proportion 0 7          Generates indices  making random stratified split into training set and testing sets     with proportions train proportion and  1-train proportion  of initial sample      y is any iterable indicating classes of each observation in the sample      Initial proportions of classes inside training and      testing sets are preserved  stratified sampling                y np array y      train inds   np zeros len y  dtype bool      test inds   np zeros len y  dtype bool      values   np unique y      for value in values          value inds   np nonzero y  value  0          np random shuffle value inds          n   int train proportion len value inds            train inds value inds  n   True         test inds value inds n    True      return train inds test inds   df train inds  and df test inds  give you the training and testing sets of your original DataFrame df

User · Answer

In my case  I wanted to split a data frame in Train  test and dev with a specific number  Here I am sharing my solution First  assign a unique id to a dataframe  if already not exist  import uuid df  id      uuid uuid4   for i in range len df     Here are my split numbers  train   120765 test    4134 dev     2816  The split function def df split df  n            first    df sample n      second   df  df id isin list first  id          first reset index drop True  inplace   True      second reset index drop True  inplace   True      return first  second  Now splitting into train  test  dev train  test   df split df  120765  test  dev     df split test  4134

User · Answer

import pandas as pd  from sklearn model selection import train test split  datafile name    path to data file   data   pd read csv datafile name   target attribute   data  column name    X train  X test  y train  y test   train test split data  target attribute  test size 0 8

User · Answer

I would use scikit-learn s own training test split  and generate it from the index  from sklearn model selection import train test split   y   df pop  output   X   df  X train X test y train y test   train test split X index y test size 0 2  X iloc X train    return dataframe train

User · Answer

I would just use numpy s randn   In  11   df   pd DataFrame np random randn 100  2    In  12   msk   np random rand len df    lt  0 8  In  13   train   df msk   In  14   test   df  msk    And just to see this has worked   In  15   len test  Out 15   21  In  16   len train  Out 16   79

User · Answer

No need to convert to numpy  Just use a pandas df to do the split and it will return a pandas df  from sklearn model selection import train test split  train  test   train test split df  test size 0 2   And if you want to split x from y X train  X test  y train  y test   train test split df list of x cols   df y col  test size 0 2    And if you want to split the whole df X  y   df list of x cols   df y col

User · Answer

To split into more than two classes such as train  test  and validation  one can do   probs   np random rand len df   training mask   probs  lt  0 7 test mask    probs gt  0 7   amp   probs  lt  0 85  validatoin mask   probs  gt   0 85   df training   df training mask  df test   df test mask  df validation   df validatoin mask    This will put approximately 70  of data in training  15  in test  and 15  in validation

User · Answer

If you need to split your data with respect to the lables column in your data set you can use this   def split to train test df  label column  train frac 0 8       train df  test df   pd DataFrame    pd DataFrame       labels   df label column  unique       for lbl in labels          lbl df   df df label column     lbl          lbl train df   lbl df sample frac train frac          lbl test df   lbl df drop lbl train df index          print   n s  n--------- ntotal  d ntrain df  d ntest df  d     lbl  len lbl df   len lbl train df   len lbl test df           train df   train df append lbl train df          test df   test df append lbl test df       return train df  test df   and use it   train  test   split to train test data   class   0 7    you can also pass random state if you want to control the split randomness or use some global random seed

User · Answer

If your wish is to have one dataframe in and two dataframes out  not numpy arrays   this should do the trick   def split data df  train perc   0 8       df  train     np random rand len df    lt  train perc     train   df df train    1      test   df df train    0      split data    train   train   test   test      return split data

User · Answer

Pandas random sample will also work   train df sample frac 0 8 random state 200   random state is a seed value test df drop train index

User · Answer

You can use    tilde operator  to exclude the rows sampled using df sample    letting pandas alone handle sampling and filtering of indexes  to obtain two sets   train df   df sample frac 0 8  random state 100  test df   df  df index isin train df index

User · Answer

shuffle   np random permutation len df   test size   int len df    0 2  test aux   shuffle  test size  train aux   shuffle test size   TRAIN DF  df iloc train aux  TEST DF   df iloc test aux

User · Answer

You can use below code to create test and train samples    from sklearn model selection import train test split trainingSet  testSet   train test split df  test size 0 2    Test size can vary depending on the percentage of data you want to put in your test and train dataset

User · Answer

There are many valid answers  Adding one more to the bunch  from sklearn cross validation import train test split   gets a random 80  of the entire set X train   X sample frac 0 8  random state 1   gets the left out portion of the dataset X test   X loc  df model index isin X train index

User · Answer

you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe      import pandas as pd df pd read csv   content drive My Drive snippet csv   sep   t   from sklearn model selection import train test split  train  test   train test split df  test size 0 2  train1 pd DataFrame train  test1 pd DataFrame test  train1 to csv   content drive My Drive train csv  sep   t  header None  encoding  utf-8   index   False  test1 to csv   content drive My Drive test csv  sep   t  header None  encoding  utf-8   index   False

User · Answer

This is what I wrote when I needed to split a DataFrame  I considered using Andy s approach above  but didn t like that I could not control the size of the data sets exactly  i e   it would be sometimes 79  sometimes 81  etc     def make sets data df  test portion       import random as rnd      tot ix   range len data df       test ix   sort rnd sample tot ix  int test portion   len data df         train ix   list set tot ix    set test ix        test df   data df ix test ix      train df   data df ix train ix       return train df  test df   train df  test df   make sets data df  0 2  test df head

User · Answer

Just select range row from df like this  row count   df shape 0  split point   int row count 1 5  test data  train data   df  split point   df split point

[python] How do I create test and train samples from one dataframe with pandas?

Examples related to python

Examples related to python-2.7

Examples related to pandas

Examples related to dataframe