Nkululeko: how to use train/dev/test splits

Supervised machine learning operates as follows: during the training phase, a learning algorithm is adapted to a training dataset, producing a trained model, which is then used to make predictions on a test set during the inference phase.

One potential issue with this approach is that, for sufficiently complex models, they may simply memorise all items in the training set rather than learning a generalised distinction based on an underlying process, such as emotional expression or speaker age. This means that while the model performs well on the training data, it fails to generalise to new data—a phenomenon known as overfitting.

To mitigate this, the model's hyperparameters are optimised using a held-out evaluation set that is not used during training. One particularly important hyperparameter is the number of epochs—that is, the number of times the entire training set is processed. Typically, to prevent overfitting, training is halted when performance on the evaluation set begins to decline, a technique known as early stopping. The model that performs best on the evaluation data is then selected.

However, this approach introduces a new problem: the model may (and most likely has) now overfitted to the evaluation data. This is why a third dataset is necessary for final testing—one that has not been used at any stage of model development.

The evaluation set is often referred to as the dev set (short for development set). Consequently, Nkululeko now provides support for three distinct data splits: train, dev, and test.

Here is an example how you would do this with emoDB (the distribution has no predefined splits for train, dev and test)

[EXP]
root = ./experiments/emodb_3split/
name = results
epochs = 100
traindevtest = True
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
labels = ["neutral", "sadness", "happiness"]
target = emotion
[FEATS]
type = ['os']
[MODEL]
type = mlp
layers = {'l1':100, 'l2':16}
patience = 10
[PLOT]
best_model = True
epoch_progression = True

You trigger the handling of three splits with

traindevtest = True

and the rest happens in this case automatically, the results are then shown for

best model based on development sets:

best model for dev set, but evaluated on test set:

and the last model, evaluated on the dev set:

In this case, you see that the 62nd epoch performed like the 52nd for the dev set. But, this best model evaluated on the test set, drops by more than 20 % average recall, which is a more stable value for the general performance of this model (this is only a toy example with 4 speakers in the training, and 2 each for dev and test set.)

Leave a Reply

Your email address will not be published. Required fields are marked *