How to split your data

In supervised machine learning, you usually need three kinds of data sets:

  • train data: to teach the model the relation between data and labels
  • dev data: (short for development) to tune meta parameters of your model, e.g. number of neurons, batch size or learning rate.
  • test data: to evaluate your model ONCE at the end to check on generalization

Of course all this is to prevent overfitting on your train and/or dev data.

If you've used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments.

So what's a good split?

Some rules apply:

  • train and dev can be from the same set, but the test set is ideally from a different database.
  • if you don't have so much data, a 80/20/20 % split is normal
  • if you have masses an data, use only so much dev and test that your population seems covered.
  • If you have really little data: use x cross validation for train and dev, still the test set should be extra

Leave a Reply

Your email address will not be published. Required fields are marked *