How to split partition a dataset into training and test datasets for e g cross validation

Question

What is a good way to split a NumPy array randomly into training and testing validation dataset  Something similar to the cvpartition or crossvalind functions in Matlab

User · Answer

After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!

I used 4 different methods (non of them are using the library sklearn, which I'm sure will give the best results, giving that it is well designed and tested code):

shuffle the whole matrix arr and then split the data to train and test
shuffle the indices and then assign it x and y to split the data
same as method 2, but in a more efficient way to do it
using pandas dataframe to split

method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.

The code for the 4 different methods I timed:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:

Method 1: 0.35883826200006297 seconds
Method 2: 1.7157016959999964 seconds
Method 3: 1.7876616719995582 seconds
Method 4: 0.07562861499991413 seconds

I hope that's helpful!

User · Answer

As sklearn cross validation module was deprecated  you can use   import numpy as np from sklearn model selection import train test split X  y   np arange 10  reshape  5  2    range 5   X trn  X tst  y trn  y tst   train test split X  y  test size 0 2  random state 42

User · Answer

If you want to split the data set once in two halves  you can use numpy random shuffle  or numpy random permutation if you need to keep track of the indices   import numpy   x is your dataset x   numpy random rand 100  5  numpy random shuffle x  training  test   x  80     x 80       or  import numpy   x is your dataset x   numpy random rand 100  5  indices   numpy random permutation x shape 0   training idx  test idx   indices  80   indices 80   training  test   x training idx     x test idx      There are many ways to repeatedly partition the same data set for cross validation  One strategy is to resample from the dataset  with repetition   import numpy   x is your dataset x   numpy random rand 100  5  training idx   numpy random randint x shape 0   size 80  test idx   numpy random randint x shape 0   size 20  training  test   x training idx     x test idx      Finally  sklearn contains several cross validation methods  k-fold  leave-n-out        It also includes more advanced  stratified sampling  methods that create a partition of the data that is balanced with respect to some features  for example to make sure that there is the same proportion of positive and negative examples in the training and test set

User · Answer

Thanks pberkes for your answer  I just modified it to avoid  1  replacement while sampling  2  duplicated instances occurred in both training and testing   training idx   np random choice X shape 0   int np round X shape 0    0 8   replace False  training idx   np random permutation np arange X shape 0     np round X shape 0    0 8       test idx   np setdiff1d  np arange 0 X shape 0    training idx

User · Answer

You may also consider stratified division into training and testing set  Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved  This makes training and testing sets better reflect the properties of the original dataset   import numpy as np    def get train test inds y train proportion 0 7          Generates indices  making random stratified split into training set and testing sets     with proportions train proportion and  1-train proportion  of initial sample      y is any iterable indicating classes of each observation in the sample      Initial proportions of classes inside training and      testing sets are preserved  stratified sampling                y np array y      train inds   np zeros len y  dtype bool      test inds   np zeros len y  dtype bool      values   np unique y      for value in values          value inds   np nonzero y  value  0          np random shuffle value inds          n   int train proportion len value inds            train inds value inds  n   True         test inds value inds n    True      return train inds test inds  y   np array  1 1 2 2 3 3   train inds test inds   get train test inds y train proportion 0 5  print y train inds  print y test inds    This code outputs    1 2 3   1 2 3

User · Answer

Just a note  In case you want train  test  AND validation sets  you can do this   from sklearn cross validation import train test split  X   get my X   y   get my y   x train  x test  y train  y test   train test split X  y  test size 0 3  x test  x val  y test  y val   train test split x test  y test  test size 0 5    These parameters will give 70   to training  and 15   each to test and val sets  Hope this helps

User · Answer

I wrote a function for my own project to do this  it doesn t use numpy  though    def partition seq  chunks          Splits the sequence into equal sized chunks and them as a list        result          for i in range chunks           chunk              for element in seq i len seq  chunks               chunk append element          result append chunk      return result   If you want the chunks to be randomized  just shuffle the list before passing it in

User · Answer

Likely you will not only need to split into train and test  but also cross validation to make sure your model generalizes   Here I am assuming 70  training data  20  validation and 10  holdout test data    Check out the np split       If indices or sections is a 1-D array of sorted integers  the entries   indicate where along axis the array is split  For example   2  3    would  for axis 0  result in      ary  2  ary 2 3  ary 3     t  v  h   np split df sample frac 1  random state 1    int 0 7 len df    int 0 9 len df

User · Answer

Here is a code to split the data into n 5 folds in a stratified manner    X   data array   y   Class label from sklearn cross validation import StratifiedKFold skf   StratifiedKFold y  n folds 5  for train index  test index in skf      print  TRAIN    train index   TEST    test index      X train  X test   X train index   X test index      y train  y test   y train index   y test index

User · Answer

Split into train test and valid   x  np expand dims np arange 100   -1    print x   indices   np random permutation x shape 0    training idx  test idx  val idx   indices  int x shape 0   9    indices int x shape 0   9  int x shape 0   95     indices int x shape 0   9  int x shape 0   95     training  test  val   x training idx     x test idx     x val idx     print training  test  val

User · Answer

I m aware that my solution is not the best  but it comes in handy when you want to split data in a simplistic way  especially when teaching data science to newbies  def simple split descriptors  targets       testX indices    i for i in range descriptors shape 0   if i   4    0      validX indices    i for i in range descriptors shape 0   if i   4    1      trainX indices    i for i in range descriptors shape 0   if i   4  gt   2       TrainX   descriptors trainX indices         ValidX   descriptors validX indices         TestX   descriptors testX indices          TrainY   targets trainX indices      ValidY   targets validX indices      TestY   targets testX indices       return TrainX  ValidX  TestX  TrainY  ValidY  TestY  According to this code  data will be split into three parts - 1 4 for the test part  another 1 4 for the validation part  and 2 4 for the training set

User · Answer

There is another option that just entails using scikit-learn  As scikit s wiki describes  you can just use the following instructions   from sklearn model selection import train test split  data  labels   np arange 10  reshape  5  2    range 5   data train  data test  labels train  labels test   train test split data  labels  test size 0 20  random state 42    This way you can keep in sync the labels for the data you re trying to split into training and test

[python] How to split/partition a dataset into training and test datasets for, e.g., cross validation?

Examples related to python

Examples related to arrays

Examples related to optimization

Examples related to numpy