In supervised machine learning, you usually need three kinds of data sets:
- train data: to teach the model the relation between data and labels
- dev data: (short for development) to tune meta parameters of your model, e.g. number of neurons, batch size or learning rate.
- test data: to evaluate your model ONCE at the end to check on generalization
Of course all this is to prevent overfitting on your train and/or dev data.
If you've used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments.
So what's a good split?
Some rules apply:
- train and dev can be from the same set, but the test set is ideally from a different database.
- if you don't have so much data, a 80/20/20 % split is normal
- if you have masses an data, use only so much dev and test that your population seems covered.
- If you have really little data: use x cross validation for train and dev, still the test set should be extra