Is there a rule-of-thumb for how to divide a dataset into training and validation sets

Question

Is there a rule-of-thumb for how to best divide data into training and validation sets  Is an even 50 50 split advisable  Or are there clear advantages of having more training data relative to validation data  or vice versa   Or is this choice pretty much application dependent   I have been mostly using an 80    20  of training and validation data  respectively  but I chose this division without any principled reason  Can someone who is more experienced in machine learning advise me

User · Answer

Suppose you have less data  I suggest to try 70   80  and 90  and test which is giving better result  In case of 90  there are chances that for 10  test you get poor accuracy

User · Answer

Perhaps a 63 2    36 8  is a reasonable choice  The reason would be that if you had a total sample size n and wanted to randomly sample with replacement  a k a  re-sample  as in the statistical bootstrap  n cases out of the initial n  the probability of an individual case being selected in the re-sample would be approximately 0 632  provided that n is not too small  as explained here  https   stats stackexchange com a 88993 16263  For a sample of n 250  the probability of an individual case being selected for a re-sample to 4 digits is 0 6329  For a sample of n 20000  the probability is 0 6321

User · Answer

Well  you should think about one more thing  If you have a really big dataset  like 1 000 000 examples  split 80 10 10 may be unnecessary  because 10    100 000 examples may be just too much for just saying that model works fine  Maybe 99 0 5 0 5 is enough because 5 000 examples can represent most of the variance in your data and you can easily tell that model works good based on these 5 000 examples in test and dev  Don t use 80 20 just because you ve heard it s ok  Think about the purpose of the test set

User · Answer

You d be surprised to find out that 80 20 is quite a commonly occurring ratio  often referred to as the Pareto principle  It s usually a safe bet if you use that ratio   However  depending on the training validation methodology you employ  the ratio may change  For example  if you use 10-fold cross validation  then you would end up with a validation set of 10  at each fold   There has been some research into what is the proper ratio between the training set and the validation set      The fraction of patterns reserved for the validation set should be   inversely proportional to the square root of the number of free   adjustable parameters    In their conclusion they specify a formula      Validation set  v  to training set  t  size ratio  v t  scales like   ln N h-max   where N is the number of families of recognizers and   h-max is the largest complexity of those families    What they mean by complexity is       Each family of recognizer is characterized by its complexity  which   may or may not be related to the VC-dimension  the description   length  the number of adjustable parameters  or other measures of   complexity    Taking the first rule of thumb  i e validation set should be inversely proportional to the square root of the number of free adjustable parameters   you can conclude that if you have 32 adjustable parameters  the square root of 32 is  5 65  the fraction should be 1 5 65 or 0 177  v t   Roughly 17 7  should be reserved for validation and 82 3  for training

User · Answer

Last year  I took Prof  Andrew Ng   s online machine learning course  His recommendation was   Training  60   Cross validation  20   Testing  20

User · Answer

It all depends on the data at hand  If you have considerable amount of data then 80 20 is a good choice as mentioned above  But if you do not Cross-Validation with a 50 50 split might help you a lot more and prevent you from creating a model over-fitting your training data

User · Answer

There are two competing concerns  with less training data  your parameter estimates have greater variance  With less testing data  your performance statistic will have greater variance  Broadly speaking you should be concerned with dividing data such that neither variance is too high  which is more to do with the absolute number of instances in each category rather than the percentage   If you have a total of 100 instances  you re probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates  If you have 100 000 instances  it doesn t really matter whether you choose an 80 20 split or a 90 10 split  indeed you may choose to use less training data if your method is particularly computationally intensive    Assuming you have enough data to do proper held-out test data  rather than cross-validation   the following is an instructive way to get a handle on variances    Split your data into training and testing  80 20 is indeed a good starting point  Split the training data into training and validation  again  80 20 is a fair split   Subsample random selections of your training data  train the classifier with this  and record the performance on the validation set Try a series of runs with different amounts of training data  randomly sample 20  of it  say  10 times and observe performance on the validation data  then do the same with 40   60   80   You should see both greater performance with more data  but also lower variance across the different random samples To get a handle on variance due to the size of test data  perform the same procedure in reverse  Train on all of your training data  then randomly sample a percentage of your validation data a number of times  and observe performance  You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data  but the variance is much higher with smaller numbers of test samples

[machine-learning] Is there a rule-of-thumb for how to divide a dataset into training and validation sets?

Examples related to machine-learning