All posts by felix

Nkululeko: how to predict topics for your texts

16. October 2025 felix Leave a comment

With nkululeko since version 1.0.1 we integrated a text classification model. It's a so-called zero-shot model, which means you can define the categories you would like to have predicted by yourself.

Prerequisite for this is that your data is transcribed, i.e. there is a text column in your data.

Here is an example ini file how to use this on a transcripted version of emodb

[EXP]
root = ./examples/results
name = emodb_textclassifier
[DATA]
databases = ['emodb']
emodb = ./examples/results//exp_emodb_translate/results/all_predicted.csv
emodb.type = csv
emodb.split_strategy = random
labels = ['anger', 'happiness']
target = emotion
[FEATS]
type = ['os']
store_format = csv
[MODEL]
type = svm
[PREDICT]
targets = ['textclassification']
textclassifier.candidates = ["sadness", "anger", "neutral", "happiness", "fear", "disgust", "boredom"]

The output is a version with all columns and one with only the pewdicted emotions (from text)

file,start,end,classification_winner,sadness,anger,neutral,happiness,fear,disgust,boredom
./data/emodb/emodb/wav/12a01Fb.wav,0 days,0 days 00:00:01.863625,neutral,0.11576763540506363,0.1414959877729416,0.3593694567680359,0.05933323875069618,0.08951663225889206,0.12100014835596085,0.11351688951253891
./data/emodb/emodb/wav/12a01Wc.wav,0 days,0 days 00:00:02.358812500,neutral,0.12048673629760742,0.1446247100830078,0.25808465480804443,0.04279503598809242,0.0794658437371254,0.25803136825561523,0.09651164710521698

It makes sense that almost all predicted labels are neutral, because emodb was designed to have linguistically neutral emotional content.

Following the winner class are the logits for all candidate classes.

nkululeko, tutorial

Nkululeko: how to compare classifiers, features and databases using multiple runs

24. September 2025 felix Leave a comment

With nkululeko since version 0.98 there is a functionality to compare the outcome for several runs across experiments.

Say, you would like to know if the difference between using acoustic (opensmile) features and linguistic embeddings (bert) as features for some classifier is significant. You could than use the outcomes of several runs from one MLP (multi layer perceptron) as tests that represent all possible runs (disclaimer: afaik this approach is disputable according to some statisticians).

You would set up your experiment like this:

[EXP]
...
runs = 10
epochs = 100
[FEATS]
type = ['bert']
#type = ['os']
#type = ['os', 'bert']
[MODEL]
type = mlp
...
patience = 5
[EXPL]
# turn on extensive statistical output
print_stats = True
[PLOT]
runs_compare = features

and run this three times, each time changing the feature type that is being used (bert, os, or the combination of both), so in the end you got a results folder three different run_results as text files in it.

Using this, nkululeko prints a plot that compares the three feature sets, here's a example (having used only 5 runs):

The title states the overall significance for all differences, as well as the largest one for pair-wise comparison. If you run-number is larger than 30, t-tests will be used instead of Mann-Whitney.

Allgemein, nkululeko, tutorial

Nkululeko tutorial: voice of wellness workshop

11. September 2025 felix Leave a comment

Context

In Sep 2025, we did the Voice of wellness workshop.

In this post i try the nkululeko experiments i use for the tutorials there.

Prepare the Database

i use the Androids corpus, paper here

First thing you should probably do is check the data formats and re-sample if necessary.

[RESAMPLE]
# which of the data splits to re-sample: train, test or all (both)
sample_selection = all
replace = True
target = data_resampled.csv

Explore

Check the database distributions

python -m nkululeko.explore --config data/androids/exp.in

Transcribe and translate

transcribe Note! this should be done on a GPU

translate, no GPU required as it uses a Google service

Segment

Androids database samples are quite long sometimes.
It makes sense to check if approaches work better on shorter speech segments.

python -m nkululeko.segment --config data/androids/exp.ini

Filter the data

[DATA]
data.limit_samples_per_speaker = 8
data.filter = [['task', 'interview']]
check_size = 1000

Define splits

Either use pre-defined folds:

[MODEL]
logo=5

or, randomly define splits, but stratify them:

[DATA]
data.split_strategy = balanced
data.balance = {'depression':2, 'age':1, 'gender':1}
data.age_bins = 2

Add additional training data

More details here

[DATA]
databases = ['data', 'emodb']
data.split_strategy = speaker_split
# add German emotional data
emodb = ./data/emodb/emodb
# rename emotion to depression
emodb.colnames = {"emotion": "depression"}
# only use neutral and sad samples
emodb.filter = [["depression", ["neutral", "sadness"]]]
# map them to depression
emodb.mapping = {"neutral": "control", "sadness": "depressed"}
# and put everything to the training
emodb.split_strategy = train
target = depression
labels = ['depressed', 'control']

nkululeko

Nkululeko: how to align databases

6. August 2025 felix Leave a comment

Sometimes you might want to combine databases that are similar, or alike, but don't handle exactly the same phenomena.

Take for example stress and emotion, you don't have enough data that labels stress, but many emotion databases that label anger and happiness. You might try the approach to use angry samples as stressed and happy or neutral as non-stressed.

Taking the usual emodb as example, and famous Susas as a database sampling stressed voices, you can do this like this:

[DATA]
databases = ['emodb', 'susas']

emodb = ./data/emodb/emodb
# indicate where the target values are
emodb.target_tables = ["emotion"]
# rename emotion to stress
emodb.colnames = {"emotion": "stress"}
# only use angry, neutral and happy samples
emodb.filter = [["stress", ["anger", "neutral", "happiness"]]]
# map them to stress
emodb.mapping = {"anger": "stress",  "neutral": "no stress", "happiness": "no stress"}
# and put everything to the training
emodb.split_strategy = train

susas = data/susas/
# map ternary stress labes to binary
susas.mapping = {'0,1':'no stress', '2':'stress'}
susas.split_strategy = speaker_split

target = stress
labels = ["stress", "no stress"]

So Susas will be split into train and test, but the training will be strenghend by the whole of emodb. This usually makes actually more sense if a third database is available for evaluation, because in-domain machine learning in most of the cases always works better than adding out-of-domain data (like we do here with emodb).

Allgemein

Nkululeko: using uncertainty

4. August 2025 felix Leave a comment

With nkululeko since version 0.94 (aleatoric) uncertainty, i.e. the confidence of the model, is explicitly visualized. You simply find a plot in the image folder after running an experiment, like so:

You see the distribution for true vs. false predictions wrt. uncertainty, i.e. in this case this worked out quite well (because less uncertain prediction are usually correct).

The approach is described in our paper Uncertainty-Based Ensemble Learning For Speech Classification

You can use this to tweak your results if you specify an uncertainty-threshold, i.e. you refuse to predict sample that are above some threshold:

[PLOT]
uncertainty_thresshold = .4

You will than get additionally a confusion plot that only takes the selected samples into account.

This might feel like cheating, but especially in critical use cases it might be better to deliver not prediction than a wrong one.

Allgemein

Nkululeko: feauture scaling

4. August 2025 felix Leave a comment

As described in this previous post, features scaling can be quite important in machine learning.

With nkululeko since version 0.97 you have a multitude if scaling methods at hand.

You simply state in the config:

[FEATS]
scale = xxx

For xxx you specify the scaling methods are

standard: z-transformation (mean of 0 and std of 1) based on the training set
- robust: robust scaler
speaker: like standard but based on individual speaker sets (also for the test)
bins: convert feature values into 0, .5 and 1 (for low, mid and high)
minmax: rescales the data set such that all feature values are in the range [0, 1]
maxabs: similar to MinMaxScaler except that the values are mapped across several ranges depending on whether negative OR positive values are present
normalizer: scales each sample (row) individually to have unit norm (e.g., L2 norm)
powertransformer: applies a power transformation to each feature to make the data more Gaussian-like in order to stabilize variance and minimize skewness
quantiletransformer: applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution (range [0, 1])

Allgemein

Nkululeko: how to explicitly model linguistics

22. July 2025 felix Leave a comment

With nkululeko since version 0.96 you there are linguistic feature extractors, i.e. using the text of the spoḱen words as input.

Of course you can combine them with acoustic features and use any fitting model architecture with it.

[EXP]
# optional: language for linguistics
language = de

[DATA]
data = ../mydata
# the linguistic feature extractors require a column named "text"
# example, perhaps not needed!
data.col_names = {"transcription":"text"}

[FEAT]
# combine linguistic bert features with acoustic open smile features
type = ['bert', 'os']

[MODEL]
type = xgb

Allgemein

Nkululelo: how to translate your textual transcriptions

14. July 2025 felix Leave a comment

With nkululeko since version 0.95.9 you can use google translate to translate your data automatically.

Simply set the language (default is en) in the PREDICT section and a prediction target translation like this:

[EXP]
# optional
language = de 
[PREDICT]
targets = ['translation']
# optional
target_language = en

and then run the module:

python -m nkululeko.predict --config my_conf.ini.

Allgemein

Nkululeko: how to add textual transcriptions to your data

26. June 2025 felix Leave a comment

With nkululeko since version 0.95.0 you can use whisper from openAI to transcribe you data automatically.

Simply set the language (default is en) in the header and a prediction target text like this:

[EXP]
language = en
[PREDICT]
targets = ['text']

and then run the module:

python -m nkululeko.predict --config my_conf.ini.

Allgemein

Nkululeko: how to use train/dev/test splits

31. March 2025 felix Leave a comment

Supervised machine learning operates as follows: during the training phase, a learning algorithm is adapted to a training dataset, producing a trained model, which is then used to make predictions on a test set during the inference phase.

One potential issue with this approach is that, for sufficiently complex models, they may simply memorise all items in the training set rather than learning a generalised distinction based on an underlying process, such as emotional expression or speaker age. This means that while the model performs well on the training data, it fails to generalise to new data—a phenomenon known as overfitting.

To mitigate this, the model's hyperparameters are optimised using a held-out evaluation set that is not used during training. One particularly important hyperparameter is the number of epochs—that is, the number of times the entire training set is processed. Typically, to prevent overfitting, training is halted when performance on the evaluation set begins to decline, a technique known as early stopping. The model that performs best on the evaluation data is then selected.

However, this approach introduces a new problem: the model may (and most likely has) now overfitted to the evaluation data. This is why a third dataset is necessary for final testing—one that has not been used at any stage of model development.

The evaluation set is often referred to as the dev set (short for development set). Consequently, Nkululeko now provides support for three distinct data splits: train, dev, and test.

Here is an example how you would do this with emoDB (the distribution has no predefined splits for train, dev and test)

[EXP]
root = ./experiments/emodb_3split/
name = results
epochs = 100
traindevtest = True
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
labels = ["neutral", "sadness", "happiness"]
target = emotion
[FEATS]
type = ['os']
[MODEL]
type = mlp
layers = {'l1':100, 'l2':16}
patience = 10
[PLOT]
best_model = True
epoch_progression = True

You trigger the handling of three splits with

traindevtest = True

and the rest happens in this case automatically, the results are then shown for

best model based on development sets:

best model for dev set, but evaluated on test set:

and the last model, evaluated on the dev set:

In this case, you see that the 62nd epoch performed like the 52nd for the dev set. But, this best model evaluated on the test set, drops by more than 20 % average recall, which is a more stable value for the general performance of this model (this is only a toy example with 4 speakers in the training, and 2 each for dev and test set.)

speechsurfer

All posts by felix

Nkululeko: how to predict topics for your texts

Nkululeko: how to compare classifiers, features and databases using multiple runs

Nkululeko tutorial: voice of wellness workshop

Context

Prepare the Database

Explore

Transcribe and translate

Segment

Filter the data

Define splits

Add additional training data

Nkululeko: how to align databases

Nkululeko: using uncertainty

Nkululeko: feauture scaling

Nkululeko: how to explicitly model linguistics

Nkululelo: how to translate your textual transcriptions

Nkululeko: how to add textual transcriptions to your data

Nkululeko: how to use train/dev/test splits

blog around speech technology