How to split your data

In supervised machine learning, you usually need three kinds of data sets:

  • train data: to teach the model the relation between data and labels
  • dev data: (short for development) to tune meta parameters of your model, e.g. number of neurons, batch size or learning rate.
  • test data: to evaluate your model ONCE at the end to check on generalization

Of course all this is to prevent overfitting on your train and/or dev data.

If you've used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments.

So what's a good split?

Some rules apply:

  • train and dev can be from the same set, but the test set is ideally from a different database.
  • if you don't have so much data, a 80/20/20 % split is normal
  • if you have masses an data, use only so much dev and test that your population seems covered.
  • If you have really little data: use x cross validation for train and dev, still the test set should be extra

Nkululeko exercise

Edit the demo configuration

1)
Set/keep as target emotion as FEAT type os and as MODEL type xgb

Use the emodb as test and train set but try out all split methods

  • specified
  • speaker split
  • random
  • loso
  • logo
  • 5_fold_cross_validation

Which works best and why?

2)
Set the

[EXP]
epochs = 200
[MODEL] 
type = mlp
layers = {'l1':1024, 'l2':64} 
save = True
[PLOT]
epoch_progression = True
best_model = True

run the experiment.
Find the epoch progression plot and see at which epoch overfitting starts.

Leave a Reply

Your email address will not be published. Required fields are marked *