Nkululeko: how to use train/dev/test splits

Supervised machine learning operates as follows: during the training phase, a learning algorithm is adapted to a training dataset, producing a trained model, which is then used to make predictions on a test set during the inference phase.

One potential issue with this approach is that, for sufficiently complex models, they may simply memorise all items in the training set rather than learning a generalised distinction based on an underlying process, such as emotional expression or speaker age. This means that while the model performs well on the training data, it fails to generalise to new data—a phenomenon known as overfitting.

To mitigate this, the model's hyperparameters are optimised using a held-out evaluation set that is not used during training. One particularly important hyperparameter is the number of epochs—that is, the number of times the entire training set is processed. Typically, to prevent overfitting, training is halted when performance on the evaluation set begins to decline, a technique known as early stopping. The model that performs best on the evaluation data is then selected.

However, this approach introduces a new problem: the model may (and most likely has) now overfitted to the evaluation data. This is why a third dataset is necessary for final testing—one that has not been used at any stage of model development.

The evaluation set is often referred to as the dev set (short for development set). Consequently, Nkululeko now provides support for three distinct data splits: train, dev, and test.

Here is an example how you would do this with emoDB (the distribution has no predefined splits for train, dev and test)

[EXP]
root = ./experiments/emodb_3split/
name = results
epochs = 100
traindevtest = True
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
labels = ["neutral", "sadness", "happiness"]
target = emotion
[FEATS]
type = ['os']
[MODEL]
type = mlp
layers = {'l1':100, 'l2':16}
patience = 10
[PLOT]
best_model = True
epoch_progression = True

You trigger the handling of three splits with

traindevtest = True

and the rest happens in this case automatically, the results are then shown for

best model based on development sets:

best model for dev set, but evaluated on test set:

and the last model, evaluated on the dev set:

In this case, you see that the 62nd epoch performed like the 52nd for the dev set. But, this best model evaluated on the test set, drops by more than 20 % average recall, which is a more stable value for the general performance of this model (this is only a toy example with 4 speakers in the training, and 2 each for dev and test set.)

Nkululeko: predict speaker id

With nkululeko since version 0.93.0 the pyannote segmentation package is interfaced (as an alternative to silero)

There are two modules that you can use for this:

  • SEGMENT
  • PREDICT

The (huge) difference is, that the SEGMENT module looks at each file in the input data and looks for speakers per file (can be only one large file), while the PREDICT module concatenates all input data and looks for different speakers in the whole database.

In any case best run it on a GPU, as CPU will be very slow (and there is no progress bar).

Segment module

If you specify the method in [SEGMENT] section and the hf_token (needed for the pyannote model) in the [MODEL] section

[SEGMENT]
method = pyannote
segment_target = _segmented
sample_selection = all
[MODEL]
hf_token = <my hugging face token>

your resulting segmentations will have predicted speaker id attachched.. Be aware that this is really slow on CPU, so best run on GPU and declare so in the [MODEL] section:

[MODEL]
hf_token = <my hugging face token>
device=gpu # or cuda:0

As a result a new plot would appear in the image folder: the distribution of speakers that were found, e.g. like this:

Predict module

Simply select speaker as the prediction target:

[PREDICT]
targets = ["speaker"]

Generally, the PREDICT module is described here

Nkululeko: ensemble learners with late fusion

With nkululeko since version 0.88.0 you can combine experiment results and report on the outcome, by using the ensemble module.

For example, you would like to know if the combination of expert features and learned embeddings works better than one of those. You could then do

python -m nkululeko.ensemble \
--method max_class \
tests/exp_emodb_praat_xgb.ini \
tests/exp_emodb_ast_xgb.ini \
tests/exp_emodb_wav2vec_xgb.in

(all in one line)
and would then get the results for a majority voting of the three results for Praat, AST and Wav2vec2 features.

Other methods are mean, max, sum, max_class, uncertainty_threshold, uncertainty_weighted, confidence_weighted:

  • majority_voting: The modality function for classification: predict the category that most classifiers agree on.
  • mean: For classification: compute the arithmetic mean of probabilities from all predictors for each labels, use highest probability to infer the label.
  • max: For classification: use the maximum value of probabilities from all predictors for each labels, use highest probability to infer the label.
  • sum: For classification: use the sum of probabilities from all predictors for each labels, use highest probability to infer the label.
  • max_class: For classification: compare the highest probabilities of all models across classes (instead of same class as in max_ensemble) and return the highest probability and the class
  • uncertainty_threshold: For classification: predict the class with the lowest uncertainty if lower than a threshold (default to 1.0, meaning no threshold), else calculate the mean of uncertainties for all models per class and predict the lowest.
  • uncertainty_weighted: For classification: weigh each class with the inverse of its uncertainty (1/uncertainty), normalize the weights per model, then multiply each class model probability with their normalized weights and use the maximum one to infer the label.
  • confidence_weighted: Weighted ensemble based on confidence (1-uncertainty), normalized for all samples per model. Like before, but use confidence (instead of inverse of uncertainty) as weights.

Nkululeko: export acoustic features

With nkululeko since version 0.85.0 the acoustic features for the test and the train (aka dev) set are exported to the project store.

If you specify the store_format:

[FEATS]
store_format = csv

they will be exported to CSV (comma separated value) files, else PKL (readable by python pickle module).
I.e. you store should then after execution of any nkululeko module that computes features the two files:

  • feats_test.csv
  • feats_train.csv

If you specified scaling the features:

[FEATS]
scale = standard # or speaker

you will have two additional files with features:

  • feats_test_scaled.csv
  • feats_train_scaled..csv

In contrast to the other feature stores, these contain the exact features that are used for training or feature importance exploration, so they might be combined from different feature types and selected via the features value. An example:

[FEATS]
type = ['praat', 'os']
features = ['speechrate_nsyll_dur', 'F0semitoneFrom27.5Hz_sma3nz_amean']
scale = standard
store_format = csv

results in the following feats_test.csv:

file,start,end,speechrate_nsyll_dur,F0semitoneFrom27.5Hz_sma3nz_amean
./data/emodb/emodb/wav/11b03Wb.wav,0 days,0 days 00:00:05.213500,4.028004219813945,34.42206
./data/emodb/emodb/wav/16b10Td.wav,0 days,0 days 00:00:03.934187500,3.0501850763340586,31.227554

....

Nkululeko: how to finetune a transformer model

With nkululeko since version 0.85.0 you can finetune a transformer model with huggingface (and even publish it there if you like).

If you like to have your model published, set:

[MODEL]
push_to_hub = True

Finetuning in this context means to train the (pre-trained) transformer layers with your new training data labels, as opposed to only using the last layer as embeddings.

The only thing you need to do is to set your MODEL type to finetune:

[FEATS]
type = []
[MODEL]
type = finetune

The acoustic features can/should be empty, because the transformer model starts with CNN layers to model the acoustics frame-wise. The frames are then getting pooled by the model for the whole utterance (max. duration the first 8 seconds, the rest is ignored).

The default base model is the one from facebook, but you can specify a different one like this:

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base

duration = 10.5

The parameter max_duration is also optional (default=8) and means the maximum duration of your samples / segments (in seconds) that will be used, starting from 0. The rest is disregarded.

You can use the usual deep learning parameters:

[MODEL]
learning_rate = .001
batch_size = 16
device = cuda:3
measure = mse
loss = mse

but all of them have defaults.

The loss function is fixed to

  • weighted cross entropy for classification
  • concordance correlation coefficient for regression

The resulting best model and the huggingface logs (which can be read by tensorboard) are stored in the project folder.

How to use train, dev and test splits with Nkululeko

Usually in machine learning, you train your predictor on a train set, tune meta-parameters on a dev (development or validation set ) and evaluate on a test set.
With nkululeko, there currently the test set is not, as there are only two sets that can be specified: train and evaluation set.
A work-around is to use the test module to evaluate your best model on a hold out test set at the end of your experiments.
All you need to do is to specify the name of the test data in your [DATA] section, like so (let's call it myconf.ini):

[EXP]
save = True
....
[DATA]
databases =  ['my_train-dev_data']
... 
tests = ['my_test_data']
my_test_data = ./data/my_test_data/
my_test_data.split_strategy = test
...

you can run the experiment module with your config:

python -m nkululeko.nkululeko --config myconf.ini

and then, after optimization (of predictors, features sets and meta-parameters), use the test module

python -m nkululeko.test --config myconf.ini

The results will appear at the same place as all other results, but the files are named with test and the test database as a suffix.

If you need to compare several predictors and feature sets, you can use the nkuluflag module
All you need to do, is, in your main script, if you call the nkuluflag module, pass a parameter (named --mod) to tell it to use the test module:

cmd = 'python -m nkululeko.nkuluflag --config myconf.ini  --mod test '

Nkululeko: how to tweak the target variable for database comparison

Sometimes you want to compare two different databases that share a similar target variable, say, related to likability, but in a different scaling, say the one asked on a scale from 1 to 10 and the other used likert-scale from 1-7.

With nkululeko you can re-name labels, normalize the target values, and even inverse the polarity, for each databases.

In the following example there are two databases

[DATA]
databases = ['db1', 'db2']
db1.split_strategy = test
db1.scale = True
db2.colnames = {'non-attractive':'likability'}
db2.split_strategy = train
db2.scale = True
db2.reverse = True
db2.reverse.max = 10
target = likability
bins = [-1000, .2, 1000]
labels = ['less likable', 'more likable']

The one database db1 already has a likability label and just needs to be standard-normalized, the second one db2 has a related label non-attractive which needs to be renamed, inverted (based on a hypothetical maximum value of 10) and normalized.
Then, db1 can be used as test data and db2 as training.

Nkululeko: how to bin/discretize your feature values

With nkululeko since version 0.77.8 you have the possibility to convert all feature values into the discreet classes low, mid and high

Simply state

[FEATS]
type = ['praat']
scale = bins
store_format = csv

in your config to use Praat features.
With the store format stated as csv you will be able to look at the train and test features in the store folder.

The binning will be done based on the 33 and 66 percent of the training feature values.