How to split data into 3 sets train validation and test

Question

I have a pandas dataframe and I wish to divide it to 3 separate sets  I know that using train test split from sklearn cross validation  one can divide the data in two sets  train and test   However  I couldn t find any solution about splitting the data into three sets  Preferably  I d like to have the indices of the original data    I know that a workaround would be to use train test split two times and somehow adjust the indices  But is there a more standard   built-in way to split the data into 3 sets instead of 2

User · Answer

In the case of supervised learning  you may want to split both X and y  where X is your input and y the ground truth output   You just have to pay attention to shuffle X and y the same way before splitting  Here  either X and y are in the same dataframe  so we shuffle them   separate them and apply the split for each  just like in chosen answer   or X and y are in two different dataframes  so we shuffle X  reorder y the same way as the shuffled X and apply the split to each    1st case  df contains X and y  where y is the  quot target quot  column of df  df shuffled   df sample frac 1  X shuffled   df shuffled drop  quot target quot   axis   1  y shuffled   df shuffled  quot target quot      2nd case  X and y are two separated dataframes X shuffled   X sample frac 1  y shuffled   y X shuffled index     We do the split as in the chosen answer X train  X validation  X test   np split X shuffled   int 0 6 len X   int 0 8 len X     y train  y validation  y test   np split y shuffled   int 0 6 len X   int 0 8 len X

User · Answer

Numpy solution  We will shuffle the whole dataset first  df sample frac 1  random state 42   and then split our data set into the following parts   60  - train set  20  - validation set  20  - test set   In  305   train  validate  test                   np split df sample frac 1  random state 42                            int  6 len df    int  8 len df      In  306   train Out 306             A         B         C         D         E 0  0 046919  0 792216  0 206294  0 440346  0 038960 2  0 301010  0 625697  0 604724  0 936968  0 870064 1  0 642237  0 690403  0 813658  0 525379  0 396053 9  0 488484  0 389640  0 599637  0 122919  0 106505 8  0 842717  0 793315  0 554084  0 100361  0 367465 7  0 185214  0 603661  0 217677  0 281780  0 938540  In  307   validate Out 307             A         B         C         D         E 5  0 806176  0 008896  0 362878  0 058903  0 026328 6  0 145777  0 485765  0 589272  0 806329  0 703479  In  308   test Out 308             A         B         C         D         E 4  0 521640  0 332210  0 370177  0 859169  0 401087 3  0 333348  0 964011  0 083498  0 670386  0 169619   int  6 len df    int  8 len df    - is an indices or sections  array for numpy split    Here is a small demo for np split   usage - let s split 20-elements array into the following parts  80   10   10   In  45   a   np arange 1  21   In  46   a Out 46   array   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20    In  47   np split a   int  8   len a    int  9   len a     Out 47    array   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16     array  17  18     array  19  20

User · Answer

Note   Function was written to handle seeding of randomized set creation   You should not rely on set splitting that doesn t randomize the sets   import numpy as np import pandas as pd  def train validate test split df  train percent  6  validate percent  2  seed None       np random seed seed      perm   np random permutation df index      m   len df index      train end   int train percent   m      validate end   int validate percent   m    train end     train   df iloc perm  train end       validate   df iloc perm train end validate end       test   df iloc perm validate end        return train  validate  test   Demonstration  np random seed  3 1415   df   pd DataFrame np random rand 10  5   columns list  ABCDE    df     train  validate  test   train validate test split df   train     validate     test

User · Answer

Considering that df id your original dataframe  1 - First you split data between Train and Test  10    my test size   0 10  X train   X test  y train   y test   train test split      df index values      df label values      test size my test size      random state 42      stratify df label values         2 - Then you split the train set between train and validation  20    my val size   0 20  X train  X val  y train  y val   train test split      df loc X train   index values      df loc X train   label values      test size my val size      random state 42      stratify df loc X train   label values       3 - Then  you slice the original dataframe according to the indices generated in the steps above    data type is not necessary   df  data type       not set   df shape 0  df loc X train   data type      train  df loc X val   data type      val  df loc X test   data type      test   The result is going to be like this   Note  This soluctions uses the workaround mentioned in the question

User · Answer

def train val test split X  y  train size  val size  test size       X train val  X test  y train val  y test   train test split X  y  test size   test size      relative train size   train size    val size   train size      X train  X val  y train  y val   train test split X train val  y train val                                                        train size   relative train size  test size   1-relative train size      return X train  X val  X test  y train  y val  y test  Here we split data 2 times with sklearn s train test split

User · Answer

Here is a Python function that splits a Pandas dataframe into train  validation  and test dataframes with stratified sampling  It performs this split by calling scikit-learn s function train test split   twice   import pandas as pd from sklearn model selection import train test split  def split stratified into train val test df input  stratify colname  y                                            frac train 0 6  frac val 0 15  frac test 0 25                                           random state None               Splits a Pandas dataframe into three subsets  train  val  and test      following fractional ratios provided by the user  where each subset is     stratified by the values in a specific column  that is  each subset has     the same relative frequency of the values in the column   It performs this     splitting by running train test split   twice       Parameters     ----------     df input   Pandas dataframe         Input dataframe to be split      stratify colname   str         The name of the column that will be used for stratification  Usually         this column would be for the label      frac train   float     frac val     float     frac test    float         The ratios with which the dataframe will be split into train  val  and         test data  The values should be expressed as float fractions and should         sum to 1 0      random state   int  None  or RandomStateInstance         Value to be passed to train test split         Returns     -------     df train  df val  df test           Dataframes containing the three splits               if frac train   frac val   frac test    1 0          raise ValueError  fractions  f   f   f do not add up to 1 0                                frac train  frac val  frac test        if stratify colname not in df input columns          raise ValueError   s is not a column in the dataframe     stratify colname        X   df input   Contains all columns      y   df input  stratify colname     Dataframe of just the column on which to stratify         Split original dataframe into train and temp dataframes      df train  df temp  y train  y temp   train test split X                                                            y                                                            stratify y                                                            test size  1 0 - frac train                                                             random state random state         Split the temp dataframe into val and test dataframes      relative frac test   frac test    frac val   frac test      df val  df test  y val  y test   train test split df temp                                                        y temp                                                        stratify y temp                                                        test size relative frac test                                                        random state random state       assert len df input     len df train    len df val    len df test       return df train  df val  df test   Below is a complete working example   Consider a dataset that has a label upon which you want to perform the stratification  This label has its own distribution in the original dataset  say 75  foo  15  bar and 10  baz  Now let s split the dataset into train  validation  and test into subsets using a 60 20 20 ratio  where each split retains the same distribution of the labels  See the illustration below     Here is the example dataset   df   pd DataFrame     A   list range 0  100                          B   list range 100  0  -1                          label     foo     75     bar     15     baz     10      df head        A    B label   0  0  100   foo   1  1   99   foo   2  2   98   foo   3  3   97   foo   4  4   96   foo  df shape    100  3   df label value counts     foo    75   bar    15   baz    10   Name  label  dtype  int64   Now  let s call the split stratified into train val test   function from above to get train  validation  and test dataframes following a 60 20 20 ratio   df train  df val  df test         split stratified into train val test df  stratify colname  label   frac train 0 60  frac val 0 20  frac test 0 20    The three dataframes df train  df val  and df test contain all the original rows but their sizes will follow the above ratio   df train shape   60  3   df val shape   20  3   df test shape   20  3    Further  each of the three splits will have the same distribution of the label  namely 75  foo  15  bar and 10  baz   df train label value counts     foo    45   bar     9   baz     6   Name  label  dtype  int64  df val label value counts     foo    15   bar     3   baz     2   Name  label  dtype  int64  df test label value counts     foo    15   bar     3   baz     2   Name  label  dtype  int64

User · Answer

However  one approach to dividing the dataset into train  test  cv with 0 6  0 2  0 2 would be to use the train test split method twice   from sklearn model selection import train test split  x  x test  y  y test   train test split xtrain labels test size 0 2 train size 0 8  x train  x cv  y train  y cv   train test split x y test size   0 25 train size  0 75

User · Answer

It is very convenient to use train test split without performing reindexing after dividing to several sets and not writing some additional code  Best answer above does not mention that by separating two times using train test split not changing partition sizes won t give initially intended partition   x train  x remain   train test split x  test size  val size   test size     Then the portion of validation and test sets in the x remain change and could be counted as  new test size   np around test size    val size   test size   2    To preserve  new test size   new val size    1 0  new val size   1 0 - new test size  x val  x test   train test split x remain  test size new test size    In this occasion all initial partitions are saved

[pandas] How to split data into 3 sets (train, validation and test)?

Examples related to pandas

Examples related to numpy

Examples related to dataframe

Examples related to machine-learning

Examples related to scikit-learn