I have a large dataset and want to split it into training(50%) and testing set(50%).
Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.
My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.
This could be achieved easily in Matlab
fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', '\n');
plist=randperm(100);
for i=1:50
trainstring = C{plist(i)};
fprintf(train_file,trainstring);
end
for i=51:100
teststring = C{plist(i)};
fprintf(test_file,teststring);
end
But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.
You could also use numpy. When your data is stored in a numpy.ndarray:
import numpy as np
from random import sample
l = 100 #length of data
f = 50 #number of elements you need
indices = sample(range(l),f)
train_data = data[indices]
test_data = np.delete(data,indices)
To answer @desmond.carros question, I modified the best answer as follows,
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set
The code splits the entire dataset to 80% train and 20% test data
Well first of all there's no such thing as "arrays" in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users
A quick note for the answer from @subin sahayam
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set
If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.
test_data = data[int(len(data)*.80+1):]
The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2
below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.
import random, math
def k_fold(myfile, myseed=11109, k=3):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed=myseed
random.shuffle(data)
# Compute partition size given input k
len_part=int(math.ceil(len(data)/float(k)))
# Create one partition per fold
train={}
test={}
for ii in range(k):
test[ii] = data[ii*len_part:ii*len_part+len_part]
train[ii] = [jj for jj in data if jj not in test[ii]]
return train, test
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
You can try this approach
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)
UPDATE: train_test_split
was moved to model_selection
so the current way (scikit-learn 0.22.2) to do it is this:
import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)
sklearn.cross_validation
is deprecated since version 0.18, instead you should use sklearn.model_selection
as show below
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
Source: Stackoverflow.com