[machine-learning] Is there a rule-of-thumb for how to divide a dataset into training and validation sets?

Is there a rule-of-thumb for how to best divide data into training and validation sets? Is an even 50/50 split advisable? Or are there clear advantages of having more training data relative to validation data (or vice versa)? Or is this choice pretty much application dependent?

I have been mostly using an 80% / 20% of training and validation data, respectively, but I chose this division without any principled reason. Can someone who is more experienced in machine learning advise me?

This question is related to machine-learning

The answer is


Suppose you have less data, I suggest to try 70%, 80% and 90% and test which is giving better result. In case of 90% there are chances that for 10% test you get poor accuracy.


Well, you should think about one more thing.

If you have a really big dataset, like 1,000,000 examples, split 80/10/10 may be unnecessary, because 10% = 100,000 examples may be just too much for just saying that model works fine.

Maybe 99/0.5/0.5 is enough because 5,000 examples can represent most of the variance in your data and you can easily tell that model works good based on these 5,000 examples in test and dev.

Don't use 80/20 just because you've heard it's ok. Think about the purpose of the test set.


It all depends on the data at hand. If you have considerable amount of data then 80/20 is a good choice as mentioned above. But if you do not Cross-Validation with a 50/50 split might help you a lot more and prevent you from creating a model over-fitting your training data.


Perhaps a 63.2% / 36.8% is a reasonable choice. The reason would be that if you had a total sample size n and wanted to randomly sample with replacement (a.k.a. re-sample, as in the statistical bootstrap) n cases out of the initial n, the probability of an individual case being selected in the re-sample would be approximately 0.632, provided that n is not too small, as explained here: https://stats.stackexchange.com/a/88993/16263

For a sample of n=250, the probability of an individual case being selected for a re-sample to 4 digits is 0.6329. For a sample of n=20000, the probability is 0.6321.


You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.

However, depending on the training/validation methodology you employ, the ratio may change. For example: if you use 10-fold cross validation, then you would end up with a validation set of 10% at each fold.

There has been some research into what is the proper ratio between the training set and the validation set:

The fraction of patterns reserved for the validation set should be inversely proportional to the square root of the number of free adjustable parameters.

In their conclusion they specify a formula:

Validation set (v) to training set (t) size ratio, v/t, scales like ln(N/h-max), where N is the number of families of recognizers and h-max is the largest complexity of those families.

What they mean by complexity is:

Each family of recognizer is characterized by its complexity, which may or may not be related to the VC-dimension, the description length, the number of adjustable parameters, or other measures of complexity.

Taking the first rule of thumb (i.e.validation set should be inversely proportional to the square root of the number of free adjustable parameters), you can conclude that if you have 32 adjustable parameters, the square root of 32 is ~5.65, the fraction should be 1/5.65 or 0.177 (v/t). Roughly 17.7% should be reserved for validation and 82.3% for training.


Last year, I took Prof: Andrew Ng’s online machine learning course. His recommendation was:

Training: 60%

Cross validation: 20%

Testing: 20%