[python] How to split data into trainset and testset randomly?

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.

My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.

This could be achieved easily in Matlab

C = textscan(fid, '%s','delimiter', '\n');
for i=1:50
    trainstring = C{plist(i)};
for i=51:100
    teststring = C{plist(i)};

But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.

This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

import random

with open("datafile.txt", "rb") as f:
    data = f.read().split('\n')


train_data = data[:50]
test_data = data[50:]

Well first of all there's no such thing as "arrays" in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users

You can try this approach

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)

UPDATE: train_test_split was moved to model_selection so the current way (scikit-learn 0.22.2) to do it is this:

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)

To answer @desmond.carros question, I modified the best answer as follows,

 import random
 for line in file:
    data.append(line.split(#your preferred delimiter))
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set

The code splits the entire dataset to 80% train and 20% test data

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

sklearn.cross_validation is deprecated since version 0.18, instead you should use sklearn.model_selection as show below

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

You could also use numpy. When your data is stored in a numpy.ndarray:

import numpy as np
from random import sample
l = 100 #length of data 
f = 50  #number of elements you need
indices = sample(range(l),f)

train_data = data[indices]
test_data = np.delete(data,indices)

A quick note for the answer from @subin sahayam

 import random
 for line in file:
    data.append(line.split(#your preferred delimiter))
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set

If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.

test_data = data[int(len(data)*.80+1):]

The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2 below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

import random, math

def k_fold(myfile, myseed=11109, k=3):
    # Load data
    data = open(myfile).readlines()

    # Shuffle input

    # Compute partition size given input k

    # Create one partition per fold
    for ii in range(k):
        test[ii]  = data[ii*len_part:ii*len_part+len_part]
        train[ii] = [jj for jj in data if jj not in test[ii]]

    return train, test