Considering that df
id your original dataframe:
1 - First you split data between Train and Test (10%):
my_test_size = 0.10
X_train_, X_test, y_train_, y_test = train_test_split(
df.index.values,
df.label.values,
test_size=my_test_size,
random_state=42,
stratify=df.label.values,
)
2 - Then you split the train set between train and validation (20%):
my_val_size = 0.20
X_train, X_val, y_train, y_val = train_test_split(
df.loc[X_train_].index.values,
df.loc[X_train_].label.values,
test_size=my_val_size,
random_state=42,
stratify=df.loc[X_train_].label.values,
)
3 - Then, you slice the original dataframe according to the indices generated in the steps above:
# data_type is not necessary.
df['data_type'] = ['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
df.loc[X_test, 'data_type'] = 'test'
The result is going to be like this:
Note: This soluctions uses the workaround mentioned in the question.