[python] How do I create test and train samples from one dataframe with pandas?

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.

Thanks!

This question is related to python python-2.7 pandas dataframe

The answer is


To split into more than two classes such as train, test, and validation, one can do:

probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]

This will put approximately 70% of data in training, 15% in test, and 15% in validation.


Pandas random sample will also work

train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)

No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

And if you want to split x from y

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

And if you want to split the whole df

X, y = df[list_of_x_cols], df[y_col]

There are many ways to create a train/test and even validation samples.

Case 1: classic way train_test_split without any options:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)

There are many valid answers. Adding one more to the bunch. from sklearn.cross_validation import train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]

scikit learn's train_test_split is a good one - it will split both numpy arrays as dataframes.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

Just select range row from df like this

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

If you need to split your data with respect to the lables column in your data set you can use this:

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df

and use it:

train, test = split_to_train_test(data, 'class', 0.7)

you can also pass random_state if you want to control the split randomness or use some global random seed.


You can use below code to create test and train samples :

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)

Test size can vary depending on the percentage of data you want to put in your test and train dataset.


shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]

You can make use of df.as_matrix() function and create Numpy-array and pass it.

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution

First, assign a unique id to a dataframe (if already not exist)

import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]

Here are my split numbers:

train = 120765
test  = 4134
dev   = 2816

The split function

def df_split(df, n):
    
    first  = df.sample(n)
    second = df[~df.id.isin(list(first['id']))]
    first.reset_index(drop=True, inplace = True)
    second.reset_index(drop=True, inplace = True)
    return first, second

Now splitting into train, test, dev

train, test = df_split(df, 120765)
test, dev   = df_split(test, 4134)

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.


A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]

How about this? df is my dataframe

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)

There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]

I would use scikit-learn's own training_test_split, and generate it from the index

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe

 import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)

If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data

This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe