Stratified Train Test-split in scikit-learn

Question

I need to split my data into a training set  75   and test set  25    I currently do that with the code below   X  Xt  userInfo  userInfo train   sklearn cross validation train test split X  userInfo       However  I d like to stratify my training dataset  How do I do that  I ve been looking into the StratifiedKFold method  but doesn t let me specifiy the 75  25  split and only stratify the training dataset

User · Answer

train size is 1 - tst size - vld size tst size 0 15 vld size 0 15  X train test  X valid  y train test  y valid   train test split df drop y  axis 1   df y  test size   vld size  random state 13903    X train test V pd DataFrame X train test  X valid pd DataFrame X valid   X train  X test  y train  y test   train test split X train test  y train test  test size tst size  random state 13903

User · Answer

TL DR   Use StratifiedShuffleSplit with test size 0 25  Scikit-learn provides two modules for Stratified Splitting    StratifiedKFold   This module is useful as a direct k-fold cross-validation operator  as in it will set up n folds training testing sets such that classes are equally balanced in both    Heres some code directly from above documentation    gt  gt  gt  skf   cross validation StratifiedKFold y  n folds 2   2-fold cross validation  gt  gt  gt  len skf  2  gt  gt  gt  for train index  test index in skf         print  TRAIN    train index   TEST    test index         X train  X test   X train index   X test index         y train  y test   y train index   y test index          fit and predict with X train test  Use accuracy metrics to check validation performance    StratifiedShuffleSplit   This module creates a single training testing set having equally balanced stratified  classes  Essentially this is what you want with the n iter 1  You can mention the test-size here same as in train test split   Code    gt  gt  gt  sss   StratifiedShuffleSplit y  n iter 1  test size 0 5  random state 0   gt  gt  gt  len sss  1  gt  gt  gt  for train index  test index in sss         print  TRAIN    train index   TEST    test index         X train  X test   X train index   X test index         y train  y test   y train index   y test index   gt  gt  gt    fit and predict with your classifier using the above X y train test

User · Answer

Updating  tangy answer from above to the current version of scikit-learn  0 23 2  StratifiedShuffleSplit documentation   from sklearn model selection import StratifiedShuffleSplit  n splits   1    We only want a single split in this case sss   StratifiedShuffleSplit n splits n splits  test size 0 25  random state 0   for train index  test index in sss split X  y       X train  X test   X train index   X test index      y train  y test   y train index   y test index

User · Answer

You can simply do it with train test split   method available in Scikit learn   from sklearn model selection import train test split  train  test   train test split X  test size 0 25  stratify X  YOUR COLUMN LABEL       I have also prepared a short GitHub Gist which shows how stratify option works   https   gist github com SHi-ON 63839f3a3647051a180cb03af0f7d0d9

User · Answer

As such  it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset  This is called a stratified train-test split  We can achieve this by setting the    stratify    argument to the y component of the original dataset  This will be used by the train test split   function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided    y    array

User · Answer

In addition to the accepted answer by  Andreas Mueller  just want to add that as  tangy mentioned above   StratifiedShuffleSplit most closely resembles train test split stratify   y  with added features of    stratify by default by specifying n splits  it repeatedly splits the data

User · Answer

Here s an example for continuous regression data  until this issue on GitHub is resolved   min   np amin y  max   np amax y     5 bins may be too few for larger datasets  bins       np linspace start min  stop max  num 5  y binned   np digitize y  bins  right True   X train  X test  y train  y test   train test split      X       y       stratify y binned     Where start is min and stop is max of your continuous target  If you don t set right True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin

User · Answer

update for 0 17   See the docs of sklearn model selection train test split   from sklearn model selection import train test split X train  X test  y train  y test   train test split X  y                                                      stratify y                                                       test size 0 25      update for 0 17   There is a pull request here  But you can simply do train  test   next iter StratifiedKFold        and use the train and test indices if you want

[python] Stratified Train/Test-split in scikit-learn

Examples related to python

Examples related to scikit-learn