How to split data into trainset and testset randomly

Question

I have a large dataset and want to split it into training 50   and testing set 50      Say I have 100 examples stored the input file  each line contains one example  I need to choose 50 lines as training set and 50 lines testing set    My idea is first generate a random list with length 100  values range from 1 to 100   then use the first 50 elements as the line number for the 50 training examples  The same with testing set   This could be achieved easily in Matlab  fid fopen datafile   C   textscan fid    s   delimiter     n    plist randperm 100   for i 1 50     trainstring   C plist i        fprintf train file trainstring   end for i 51 100     teststring   C plist i        fprintf test file teststring   end   But how could I accomplish this function in Python  I m new to Python  and don t know whether I could read the whole file into an array  and choose certain lines

User · Answer

A quick note for the answer from  subin sahayam    import random  file open  datafile txt   r    data list    for line in file      data append line split  your preferred delimiter    file close    random shuffle data   train data   data  int  len data  1   80    Remaining 80  to training set  test data   data int len data   80 1     Splits 20  data to test set   If your list size is a even number  you should not add the 1 in the code below  Instead  you need to check the size of the list first and then determine if you need to add the 1      test data   data int len data   80 1

User · Answer

You could also use numpy  When your data is stored in a numpy ndarray   import numpy as np from random import sample l   100  length of data  f   50   number of elements you need indices   sample range l  f   train data   data indices  test data   np delete data indices

User · Answer

Well first of all there s no such thing as  arrays  in Python  Python uses lists and that does make a difference  I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality You can get started here Numpy for Matlab users

User · Answer

To answer  desmond carros question  I modified the best answer as follows    import random  file open  datafile txt   r    data list    for line in file      data append line split  your preferred delimiter    file close    random shuffle data   train data   data  int  len data  1   80    Remaining 80  to training set  test data   data int  len data  1   80     Splits 20  data to test set   The code splits the entire dataset to 80  train and 20  test data

User · Answer

from sklearn model selection import train test split import numpy  with open  datafile txt    rb   as f     data   f read   split   n      data   numpy array data    convert array to numpy type array     x train  x test   train test split data test size 0 5         test size 0 5 whole data

User · Answer

sklearn cross validation is deprecated since version 0 18  instead you should use sklearn model selection as show below   from sklearn model selection import train test split import numpy  with open  datafile txt    rb   as f     data   f read   split   n      data   numpy array data    convert array to numpy type array     x train  x test   train test split data test size 0 5         test size 0 5 whole data

User · Answer

The following produces more general k-fold cross-validation splits  Your 50-50 partitioning would be achieved by making k 2 below  all you would have to to is to pick one of the two partitions produced  Note  I haven t tested the code  but I m pretty sure it should work   import random  math  def k fold myfile  myseed 11109  k 3         Load data     data   open myfile  readlines          Shuffle input     random seed myseed     random shuffle data         Compute partition size given input k     len part int math ceil len data  float k           Create one partition per fold     train        test        for ii in range k           test ii     data ii len part ii len part len part          train ii     jj for jj in data if jj not in test ii        return train  test

User · Answer

This can be done similarly in Python using lists   note that the whole list is shuffled in place    import random  with open  datafile txt    rb   as f      data   f read   split   n    random shuffle data   train data   data  50  test data   data 50

User · Answer

You can try this approach  import pandas import sklearn csv   pandas read csv  data csv   train  test   sklearn cross validation train test split csv  train size   0 5    UPDATE  train test split was moved to model selection so the current way  scikit-learn 0 22 2  to do it is this   import pandas import sklearn csv   pandas read csv  data csv   train  test   sklearn model selection train test split csv  train size   0 5

[python] How to split data into trainset and testset randomly?

Examples related to python

Examples related to file-io