Category Archives: Allgemein

Nkululeko: how to only extract features

With nkululeko since version 1.1.3, there is the possibility to run the demo module on pre-loaded models, i.e. feature extractors.
You can simply run the module feature_demo even without any initialization file, for example if you want to test the output of the agender model, you could call python like this (getting input from the local microphone):

 python -m nkululeko.feature_demo --model agender --mic

and get the output:

No config file specified, using default settings

================================================================================
MICROPHONE RECORDING
================================================================================
Recording NOW for 5 seconds - speak clearly!
Recording finished!
Audio saved to temporary file: /tmp/tmpht4rf3mr.wav

Processing 1 file(s)...
DEBUG: feature_demo: running feature extraction demo, nkululeko version 1.1.3
DEBUG: feature_demo: using model: agender

Initializing agender model...
(This may take a while on first run - downloading models...)
  Loading agender_agender model (this may download the model on first use)...
DEBUG: featureset: value for n_jobs is not found, using default: 8
DEBUG: featureset: value for agender.model is not found, using default: ./audmodel_agender/
DEBUG: featureset: initialized agender model
Model initialized successfully!
[1/1] Extracting features from: /tmp/tmpht4rf3mr.wav
Feature dimension: 4

Features saved to features_output.csv

================================================================================
FEATURE EXTRACTION SUMMARY
================================================================================
Total files processed: 1
Feature dimension: 4
Output shape: (1, 4)

First few rows:
                        feat_0    feat_1    feat_2    feat_3
/tmp/tmpht4rf3mr.wav  0.495758 -0.765581  6.437115 -4.332793
Cleaned up temporary recording file

DONE

In this case the output signifies the logits for age, female, male and child.

If you want to see all possible models, simply call

 python -m nkululeko.feature_demo --help

Nkululeko tutorial: voice of wellness workshop

Context

In Sep 2025, we did the Voice of wellness workshop.

In this post i try the nkululeko experiments i use for the tutorials there.

Prepare the Database

i use the Androids corpus, paper here

First thing you should probably do is check the data formats and re-sample if necessary.

[RESAMPLE]
# which of the data splits to re-sample: train, test or all (both)
sample_selection = all
replace = True
target = data_resampled.csv

Explore

Check the database distributions

python -m nkululeko.explore --config data/androids/exp.in

Transcribe and translate

transcribe Note! this should be done on a GPU

translate, no GPU required as it uses a Google service

Segment

Androids database samples are quite long sometimes.
It makes sense to check if approaches work better on shorter speech segments.

python -m nkululeko.segment --config data/androids/exp.ini

Filter the data

[DATA]
data.limit_samples_per_speaker = 8
data.filter = [['task', 'interview']]
check_size = 1000

Define splits

Either use pre-defined folds:

[MODEL]
logo=5

or, randomly define splits, but stratify them:

[DATA]
data.split_strategy = balanced
data.balance = {'depression':2, 'age':1, 'gender':1}
data.age_bins = 2

Add additional training data

More details here

[DATA]
databases = ['data', 'emodb']
data.split_strategy = speaker_split
# add German emotional data
emodb = ./data/emodb/emodb
# rename emotion to depression
emodb.colnames = {"emotion": "depression"}
# only use neutral and sad samples
emodb.filter = [["depression", ["neutral", "sadness"]]]
# map them to depression
emodb.mapping = {"neutral": "control", "sadness": "depressed"}
# and put everything to the training
emodb.split_strategy = train
target = depression
labels = ['depressed', 'control']

Nkululeko: using uncertainty

With nkululeko since version 0.94 (aleatoric) uncertainty, i.e. the confidence of the model, is explicitly visualized. You simply find a plot in the image folder after running an experiment, like so:

You see the distribution for true vs. false predictions wrt. uncertainty, i.e. in this case this worked out quite well (because less uncertain prediction are usually correct).

The approach is described in our paper Uncertainty-Based Ensemble Learning For Speech Classification

You can use this to tweak your results if you specify an uncertainty-threshold, i.e. you refuse to predict sample that are above some threshold:

[PLOT]
uncertainty_thresshold = .4

You will than get additionally a confusion plot that only takes the selected samples into account.

This might feel like cheating, but especially in critical use cases it might be better to deliver not prediction than a wrong one.

Nkululeko: feature scaling

As described in this previous post, features scaling can be quite important in machine learning.

With nkululeko since version 0.97 you have a multitude if scaling methods at hand.

You simply state in the config:

[FEATS]
scale = xxx

For xxx you specify the scaling methods are

  • standard: z-transformation (mean of 0 and std of 1) based on the training set
    • robust: robust scaler
  • speaker: like standard but based on individual speaker sets (also for the test)
  • bins: convert feature values into 0, .5 and 1 (for low, mid and high)
  • minmax: rescales the data set such that all feature values are in the range [0, 1]
  • maxabs: similar to MinMaxScaler except that the values are mapped across several ranges depending on whether negative OR positive values are present
  • normalizer: scales each sample (row) individually to have unit norm (e.g., L2 norm)
  • powertransformer: applies a power transformation to each feature to make the data more Gaussian-like in order to stabilize variance and minimize skewness
  • quantiletransformer: applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution (range [0, 1])

Nkululeko: how to explicitly model linguistics

With nkululeko since version 0.96 you there are linguistic feature extractors, i.e. using the text of the spoḱen words as input.

Of course you can combine them with acoustic features and use any fitting model architecture with it.

[EXP]
# optional: language for linguistics
language = de

[DATA]
data = ../mydata
# the linguistic feature extractors require a column named "text"
# example, perhaps not needed!
data.col_names = {"transcription":"text"}

[FEAT]
# combine linguistic bert features with acoustic open smile features
type = ['bert', 'os']

[MODEL]
type = xgb

Nkululeko: how to use train/dev/test splits

Supervised machine learning operates as follows: during the training phase, a learning algorithm is adapted to a training dataset, producing a trained model, which is then used to make predictions on a test set during the inference phase.

One potential issue with this approach is that, for sufficiently complex models, they may simply memorise all items in the training set rather than learning a generalised distinction based on an underlying process, such as emotional expression or speaker age. This means that while the model performs well on the training data, it fails to generalise to new data—a phenomenon known as overfitting.

To mitigate this, the model's hyperparameters are optimised using a held-out evaluation set that is not used during training. One particularly important hyperparameter is the number of epochs—that is, the number of times the entire training set is processed. Typically, to prevent overfitting, training is halted when performance on the evaluation set begins to decline, a technique known as early stopping. The model that performs best on the evaluation data is then selected.

However, this approach introduces a new problem: the model may (and most likely has) now overfitted to the evaluation data. This is why a third dataset is necessary for final testing—one that has not been used at any stage of model development.

The evaluation set is often referred to as the dev set (short for development set). Consequently, Nkululeko now provides support for three distinct data splits: train, dev, and test.

Here is an example how you would do this with emoDB (the distribution has no predefined splits for train, dev and test)

[EXP]
root = ./experiments/emodb_3split/
name = results
epochs = 100
traindevtest = True
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
labels = ["neutral", "sadness", "happiness"]
target = emotion
[FEATS]
type = ['os']
[MODEL]
type = mlp
layers = {'l1':100, 'l2':16}
patience = 10
[PLOT]
best_model = True
epoch_progression = True

You trigger the handling of three splits with

traindevtest = True

and the rest happens in this case automatically, the results are then shown for

best model based on development sets:

best model for dev set, but evaluated on test set:

and the last model, evaluated on the dev set:

In this case, you see that the 62nd epoch performed like the 52nd for the dev set. But, this best model evaluated on the test set, drops by more than 20 % average recall, which is a more stable value for the general performance of this model (this is only a toy example with 4 speakers in the training, and 2 each for dev and test set.)

Nkululeko: predict speaker id

With nkululeko since version 0.93.0 the pyannote segmentation package is interfaced (as an alternative to silero)

There are two modules that you can use for this:

  • SEGMENT
  • PREDICT

The (huge) difference is, that the SEGMENT module looks at each file in the input data and looks for speakers per file (can be only one large file), while the PREDICT module concatenates all input data and looks for different speakers in the whole database.

In any case best run it on a GPU, as CPU will be very slow (and there is no progress bar).

Segment module

If you specify the method in [SEGMENT] section and the hf_token (needed for the pyannote model) in the [MODEL] section

[SEGMENT]
method = pyannote
segment_target = _segmented
sample_selection = all
[MODEL]
hf_token = <my hugging face token>

your resulting segmentations will have predicted speaker id attachched.. Be aware that this is really slow on CPU, so best run on GPU and declare so in the [MODEL] section:

[MODEL]
hf_token = <my hugging face token>
device=gpu # or cuda:0

As a result a new plot would appear in the image folder: the distribution of speakers that were found, e.g. like this:

Predict module

Simply select speaker as the prediction target:

[PREDICT]
targets = ["speaker"]

Generally, the PREDICT module is described here

Nkululeko: how to finetune a transformer model

With nkululeko since version 0.85.0 you can finetune a transformer model with huggingface (and even publish it there if you like).

If you like to have your model published, set:

[MODEL]
push_to_hub = True

Finetuning in this context means to train the (pre-trained) transformer layers with your new training data labels, as opposed to only using the last layer as embeddings.

The only thing you need to do is to set your MODEL type to finetune:

[FEATS]
type = []
[MODEL]
type = finetune

The acoustic features can/should be empty, because the transformer model starts with CNN layers to model the acoustics frame-wise. The frames are then getting pooled by the model for the whole utterance (max. duration the first 8 seconds, the rest is ignored).

The default base model is the one from facebook, but you can specify a different one like this:

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base

duration = 10.5

The parameter max_duration is also optional (default=8) and means the maximum duration of your samples / segments (in seconds) that will be used, starting from 0. The rest is disregarded.

You can use the usual deep learning parameters:

[MODEL]
learning_rate = .001
batch_size = 16
device = cuda:3
measure = mse
loss = mse

but all of them have defaults.

The loss function is fixed to

  • weighted cross entropy for classification
  • concordance correlation coefficient for regression

The resulting best model and the huggingface logs (which can be read by tensorboard) are stored in the project folder.