All posts by felix

Nkululeko: how to compare classifiers, features and databases using multiple runs

With nkululeko since version 0.98 there is a functionality to compare the outcome for several runs across experiments.

Say, you would like to know if the difference between using acoustic (opensmile) features and linguistic embeddings (bert) as features for some classifier is significant. You could than use the outcomes of several runs from one MLP (multi layer perceptron) as tests that represent all possible runs (disclaimer: afaik this approach is disputable according to some statisticians).

You would set up your experiment like this:

[EXP]
...
runs = 10
epochs = 100
[FEATS]
type = ['bert']
#type = ['os']
#type = ['os', 'bert']
[MODEL]
type = mlp
...
patience = 5
[EXPL]
# turn on extensive statistical output
print_stats = True
[PLOT]
runs_compare = features

and run this three times, each time changing the feature type that is being used (bert, os, or the combination of both), so in the end you got a results folder three different run_results as text files in it.

Using this, nkululeko prints a plot that compares the three feature sets, here's a example (having used only 5 runs):

The title states the overall significance for all differences, as well as the largest one for pair-wise comparison. If you run-number is larger than 30, t-tests will be used instead of Mann-Whitney.

Nkululeko tutorial: voice of wellness workshop

Context

In Sep 2025, we did the Voice of wellness workshop.

In this post i try the nkululeko experiments i use for the tutorials there.

Prepare the Database

i use the Androids corpus, paper here

First thing you should probably do is check the data formats and re-sample if necessary.

[RESAMPLE]
# which of the data splits to re-sample: train, test or all (both)
sample_selection = all
replace = True
target = data_resampled.csv

Explore

Check the database distributions

python -m nkululeko.explore --config data/androids/exp.in

Transcribe and translate

transcribe Note! this should be done on a GPU

translate, no GPU required as it uses a Google service

Segment

Androids database samples are quite long sometimes.
It makes sense to check if approaches work better on shorter speech segments.

python -m nkululeko.segment --config data/androids/exp.ini

Filter the data

[DATA]
data.limit_samples_per_speaker = 8
data.filter = [['task', 'interview']]
check_size = 1000

Define splits

Either use pre-defined folds:

[MODEL]
logo=5

or, randomly define splits, but stratify them:

[DATA]
data.split_strategy = balanced
data.balance = {'depression':2, 'age':1, 'gender':1}
data.age_bins = 2

Add additional training data

More details here

[DATA]
databases = ['data', 'emodb']
data.split_strategy = speaker_split
# add German emotional data
emodb = ./data/emodb/emodb
# rename emotion to depression
emodb.colnames = {"emotion": "depression"}
# only use neutral and sad samples
emodb.filter = [["depression", ["neutral", "sadness"]]]
# map them to depression
emodb.mapping = {"neutral": "control", "sadness": "depressed"}
# and put everything to the training
emodb.split_strategy = train
target = depression
labels = ['depressed', 'control']

Nkululeko: how to align databases

Sometimes you might want to combine databases that are similar, or alike, but don't handle exactly the same phenomena.

Take for example stress and emotion, you don't have enough data that labels stress, but many emotion databases that label anger and happiness. You might try the approach to use angry samples as stressed and happy or neutral as non-stressed.

Taking the usual emodb as example, and famous Susas as a database sampling stressed voices, you can do this like this:

[DATA]
databases = ['emodb', 'susas']

emodb = ./data/emodb/emodb
# indicate where the target values are
emodb.target_tables = ["emotion"]
# rename emotion to stress
emodb.colnames = {"emotion": "stress"}
# only use angry, neutral and happy samples
emodb.filter = [["stress", ["anger", "neutral", "happiness"]]]
# map them to stress
emodb.mapping = {"anger": "stress",  "neutral": "no stress", "happiness": "no stress"}
# and put everything to the training
emodb.split_strategy = train

susas = data/susas/
# map ternary stress labes to binary
susas.mapping = {'0,1':'no stress', '2':'stress'}
susas.split_strategy = speaker_split

target = stress
labels = ["stress", "no stress"]

So Susas will be split into train and test, but the training will be strenghend by the whole of emodb. This usually makes actually more sense if a third database is available for evaluation, because in-domain machine learning in most of the cases always works better than adding out-of-domain data (like we do here with emodb).

Nkululeko: using uncertainty

With nkululeko since version 0.94 (aleatoric) uncertainty, i.e. the confidence of the model, is explicitly visualized. You simply find a plot in the image folder after running an experiment, like so:

You see the distribution for true vs. false predictions wrt. uncertainty, i.e. in this case this worked out quite well (because less uncertain prediction are usually correct).

The approach is described in our paper Uncertainty-Based Ensemble Learning For Speech Classification

You can use this to tweak your results if you specify an uncertainty-threshold, i.e. you refuse to predict sample that are above some threshold:

[PLOT]
uncertainty_thresshold = .4

You will than get additionally a confusion plot that only takes the selected samples into account.

This might feel like cheating, but especially in critical use cases it might be better to deliver not prediction than a wrong one.

Nkululeko: feauture scaling

As described in this previous post, features scaling can be quite important in machine learning.

With nkululeko since version 0.97 you have a multitude if scaling methods at hand.

You simply state in the config:

[FEATS]
scale = xxx

For xxx you specify the scaling methods are

  • standard: z-transformation (mean of 0 and std of 1) based on the training set
    • robust: robust scaler
  • speaker: like standard but based on individual speaker sets (also for the test)
  • bins: convert feature values into 0, .5 and 1 (for low, mid and high)
  • minmax: rescales the data set such that all feature values are in the range [0, 1]
  • maxabs: similar to MinMaxScaler except that the values are mapped across several ranges depending on whether negative OR positive values are present
  • normalizer: scales each sample (row) individually to have unit norm (e.g., L2 norm)
  • powertransformer: applies a power transformation to each feature to make the data more Gaussian-like in order to stabilize variance and minimize skewness
  • quantiletransformer: applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution (range [0, 1])

Nkululeko: how to explicitly model linguistics

With nkululeko since version 0.96 you there are linguistic feature extractors, i.e. using the text of the spoḱen words as input.

Of course you can combine them with acoustic features and use any fitting model architecture with it.

[EXP]
# optional: language for linguistics
language = de

[DATA]
data = ../mydata
# the linguistic feature extractors require a column named "text"
# example, perhaps not needed!
data.col_names = {"transcription":"text"}

[FEAT]
# combine linguistic bert features with acoustic open smile features
type = ['bert', 'os']

[MODEL]
type = xgb

Nkululeko: how to use train/dev/test splits

Supervised machine learning operates as follows: during the training phase, a learning algorithm is adapted to a training dataset, producing a trained model, which is then used to make predictions on a test set during the inference phase.

One potential issue with this approach is that, for sufficiently complex models, they may simply memorise all items in the training set rather than learning a generalised distinction based on an underlying process, such as emotional expression or speaker age. This means that while the model performs well on the training data, it fails to generalise to new data—a phenomenon known as overfitting.

To mitigate this, the model's hyperparameters are optimised using a held-out evaluation set that is not used during training. One particularly important hyperparameter is the number of epochs—that is, the number of times the entire training set is processed. Typically, to prevent overfitting, training is halted when performance on the evaluation set begins to decline, a technique known as early stopping. The model that performs best on the evaluation data is then selected.

However, this approach introduces a new problem: the model may (and most likely has) now overfitted to the evaluation data. This is why a third dataset is necessary for final testing—one that has not been used at any stage of model development.

The evaluation set is often referred to as the dev set (short for development set). Consequently, Nkululeko now provides support for three distinct data splits: train, dev, and test.

Here is an example how you would do this with emoDB (the distribution has no predefined splits for train, dev and test)

[EXP]
root = ./experiments/emodb_3split/
name = results
epochs = 100
traindevtest = True
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
labels = ["neutral", "sadness", "happiness"]
target = emotion
[FEATS]
type = ['os']
[MODEL]
type = mlp
layers = {'l1':100, 'l2':16}
patience = 10
[PLOT]
best_model = True
epoch_progression = True

You trigger the handling of three splits with

traindevtest = True

and the rest happens in this case automatically, the results are then shown for

best model based on development sets:

best model for dev set, but evaluated on test set:

and the last model, evaluated on the dev set:

In this case, you see that the 62nd epoch performed like the 52nd for the dev set. But, this best model evaluated on the test set, drops by more than 20 % average recall, which is a more stable value for the general performance of this model (this is only a toy example with 4 speakers in the training, and 2 each for dev and test set.)

Nkululeko: predict speaker id

With nkululeko since version 0.93.0 the pyannote segmentation package is interfaced (as an alternative to silero)

There are two modules that you can use for this:

  • SEGMENT
  • PREDICT

The (huge) difference is, that the SEGMENT module looks at each file in the input data and looks for speakers per file (can be only one large file), while the PREDICT module concatenates all input data and looks for different speakers in the whole database.

In any case best run it on a GPU, as CPU will be very slow (and there is no progress bar).

Segment module

If you specify the method in [SEGMENT] section and the hf_token (needed for the pyannote model) in the [MODEL] section

[SEGMENT]
method = pyannote
segment_target = _segmented
sample_selection = all
[MODEL]
hf_token = <my hugging face token>

your resulting segmentations will have predicted speaker id attachched.. Be aware that this is really slow on CPU, so best run on GPU and declare so in the [MODEL] section:

[MODEL]
hf_token = <my hugging face token>
device=gpu # or cuda:0

As a result a new plot would appear in the image folder: the distribution of speakers that were found, e.g. like this:

Predict module

Simply select speaker as the prediction target:

[PREDICT]
targets = ["speaker"]

Generally, the PREDICT module is described here