All posts by felix

Nkululeko: the new predict module

With nkululeko since version 1.7, there is the new predict module, which can be used for any us of already trained models like demoing them, using them on some test data or participating in a challenge.

nkululeko.predict

nkululeko.predict is the unified prediction module of Nkululeko. It replaces
the previous nkululeko.demo, nkululeko.feature_demo and nkululeko.testing
modules and bundles all of their functionality behind a single command-line
interface.

You can use it to predict labels for:

  • one or more individual audio files (--file)
  • every audio file inside a folder (--folder)
  • the audio paths listed in a CSV (--list, original columns are preserved)
  • a live microphone recording (--mic)
  • the dataframe defined by an experiment config — pass --config without
    any of the input flags above and the module loads the databases declared
    in [DATA] (subset via EXP.sample_selection, default all)

…using one of two prediction sources:

  • a feature extractor or autopredict target such as age, gender,
    emotion, mos, snr (--type feats, the default)
  • the best model from a previously trained experiment
    (--type model, requires --config)

Command-line interface

python -m nkululeko.predict
    [--file AUDIO [AUDIO ...] | --folder FOLDER | --list CSV | --mic]
    [--model MODEL] [--type {feats,model}]
    [--config CONFIG.ini] [--outfile OUTFILE]
    [--language LANG] [--no_playback]
Argument Description
--file AUDIO [AUDIO ...] One or more audio files. A single space-separated string also works (e.g. --file "a.wav b.wav"). Writes a per-file <name>_result.txt next to each input and prints results to stdout.
--folder FOLDER Folder to scan recursively for audio (wav, mp3, flac, ogg, m4a, au, aac). Writes a single CSV to --outfile.
--list CSV CSV with audio paths. Existing columns and the audformat index are preserved; prediction columns are appended. Writes a single CSV to --outfile.
--mic Record 5 seconds from the microphone in a loop and print predictions to stdout.
--model MODEL Either an autopredict target name (age, gender, emotion, mos, snr, pesq, sdr, stoi, arousal, valence, dominance, speaker, text, textclassification, translation) or a feature-extractor name (wav2vec2-..., opensmile, audmodel, emotion2vec-..., praat, clap, spkrec, trill, agender, whisper-..., ast, hubert-..., wavlm-..., squim, mos, snr). When --type model, --model is ignored — the trained model from the experiment is used.
--type {feats,model} feats (default): use --model as autopredict target or feature extractor. model: load the best model from the experiment defined by --config.
--config CONFIG.ini Optional INI file. Required for --type model. With --type feats it may supply FEATS.type so that --model can be omitted. When passed alone (without --file/--folder/--list/--mic), the dataframe defined by the experiment's [DATA] section is used; EXP.sample_selection (default all) selects train / test / all.
--outfile OUTFILE Output CSV path for --list and --folder. Default: ./prediction_result.csv.
--language LANG ISO 639-1 code (en, de, pl, …) for the text and translation autopredict targets. For --model text it sets the Whisper source language (overrides EXP.language). For --model translation it sets the Google Translate target language (overrides PREDICT.target_language).
--no_playback In --mic mode, suppress the playback of the recording before prediction.

The four input arguments (--file, --folder, --list, --mic) are mutually
exclusive.

Examples

Predict emotion for a couple of audio files

python -m nkululeko.predict --file test.mp3 test2.wav --model emotion

This writes test_result.txt and test2_result.txt next to each input and
also prints the predictions to stdout. With --model emotion, the
nkululeko.autopredict.ap_emotion predictor is used.

Predict SNR for every audio file in a folder

python -m nkululeko.predict --folder ./recordings --model snr --outfile snr.csv

The output CSV contains the audformat segmented index plus the new
snr_pred column.

Add prediction columns to an existing CSV, keeping all original columns

python -m nkululeko.predict \
    --list testdata.csv \
    --model mos \
    --outfile testdata_with_mos.csv

If testdata.csv is a valid audformat CSV (segmented or filewise index), the
index is preserved. Otherwise the first column is interpreted as the audio
path. Any further columns are passed through to the output.

Use the best model of a trained experiment

python -m nkululeko.predict \
    --list testdata.csv \
    --config config.ini \
    --type model

This loads the experiment specified in config.ini (which must have been
trained with MODEL.save = True) and runs its best model on each file in the
list. For classification, the output contains one column per class label with
the probability/score and a predicted column with the top-1 label. For
regression, a single predicted column is written.

Loop over microphone input using the FEATS section of a config

python -m nkululeko.predict --mic --config config.ini

Press Enter to record 5 seconds, q + Enter to quit.

Transcribe German audio with Whisper

python -m nkululeko.predict --file lecture.mp3 --model text --language de

--language de overrides EXP.language for the Whisper source language.

Predict over the dataframe defined by a config

When you only pass --config, the module loads the databases declared in the
config's [DATA] section and runs over the selection from
EXP.sample_selection (default all):

python -m nkululeko.predict \
    --config experiments/emodb/exp.ini \
    --model snr \
    --outfile emodb_snr.csv

Set EXP.sample_selection = train (or test) in the INI to restrict the
run to that subset.

Translate transcriptions to French

python -m nkululeko.predict \
    --list transcribed.csv \
    --model translation \
    --language fr \
    --outfile translated.csv

--language fr overrides PREDICT.target_language for Google Translate.

Autopredict targets

When --model NAME matches one of the autopredict targets below, the matching
nkululeko.autopredict.* predictor is used. The added column name follows the
<target>_pred convention.

Target Predictor module Added column
speaker ap_sid.SIDPredictor speaker_pred
gender ap_gender.GenderPredictor (audEERING agender) gender_pred
age ap_age.AgePredictor (audEERING agender) age_pred
emotion ap_emotion.EmotionPredictor (emotion2vec) emotion_pred
arousal ap_arousal.ArousalPredictor (audEERING dim) arousal_pred
valence ap_valence.ValencePredictor (audEERING dim) valence_pred
dominance ap_dominance.DominancePredictor (audEERING dim) dominance_pred
mos ap_mos.MOSPredictor mos_pred
pesq ap_pesq.PESQPredictor (SQUIM) pesq_pred
sdr ap_sdr.SDRPredictor (SQUIM) sdr_pred
stoi ap_stoi.STOIPredictor (SQUIM) stoi_pred
snr ap_snr.SNRPredictor snr_pred
text ap_text.TextPredictor (whisper transcription) text
textclassification ap_textclassifier.TextClassificationPredictor classification_winner + one column per candidate label
translation ap_translate.TextTranslator column named after PREDICT.target_language (default: en)

Feature extractors

If --model does not match an autopredict target, it is interpreted as a
feature-extractor name. The output columns are feat_0, feat_1, …. Examples:

python -m nkululeko.predict --file test.wav --model praat
python -m nkululeko.predict --folder ./voices --model wav2vec2-large-robust-ft-swbd-300h --outfile feats.csv
python -m nkululeko.predict --list audio.csv --model audmodel --outfile feats.csv --config has_audmodel_id.ini

Recognized prefixes / names: wav2vec2*, hubert*, wavlm*, whisper*,
ast*, emotion2vec*, opensmile/gemaps/compare, clap*, spkrec* /
xvect* / ecapa*, trill*, praat*, audmodel*, agender*, squim* /
pesq* / sdr*, mos*, snr*.

Note on overlapping names. mos and snr are both autopredict targets
and feature extractors. They resolve to the autopredict path. If you need
the raw feature extractor for these, use the lower-level extractor classes
directly.

Output formats

Mode Where the result is written
--file <name>_result.txt per input file (one key: value per line), plus stdout.
--folder Single CSV at --outfile with the audformat segmented index of the discovered files and the prediction columns.
--list Single CSV at --outfile with the original columns of the input CSV plus the prediction columns. The audformat index is preserved when the input is a valid audformat CSV.
--mic stdout only.

Nkululeko: how to investigate correlations of specific features

As shown in this post, nkululeko can be used to investigate correlations of specific features with a target variable.

Now nkululeko can also be used to check on correlation between two real-valued acoustic features.

With the key regplot you can specify two features and optionally a target variable (if omitted, the ini-file target is used) like so:

[EXPL]
regplot = [['lld_mfcc3_sma3_median', 'lld_mfcc1_sma3_median'],
['lld_mfcc3_sma3_median', 'lld_F2frequency_sma3nz_median', 'age']]

The first tuple of features is related to the emotion target (default for this example data: emodb) and would produce this plot:

The second line states age as the target, which is a continuous target and thus will be grouped

Nkululeko: how to predict topics for your texts

With nkululeko since version 1.0.1 we integrated a text classification model. It's a so-called zero-shot model, which means you can define the categories you would like to have predicted by yourself.

Prerequisite for this is that your data is transcribed, i.e. there is a text column in your data.

Here is an example ini file how to use this on a transcripted version of emodb

[EXP]
root = ./examples/results
name = emodb_textclassifier
[DATA]
databases = ['emodb']
emodb = ./examples/results//exp_emodb_translate/results/all_predicted.csv
emodb.type = csv
emodb.split_strategy = random
labels = ['anger', 'happiness']
target = emotion
[FEATS]
type = ['os']
store_format = csv
[MODEL]
type = svm
[PREDICT]
targets = ['textclassification']
textclassifier.candidates = ["sadness", "anger", "neutral", "happiness", "fear", "disgust", "boredom"]

The output is a version with all columns and one with only the pewdicted emotions (from text)

file,start,end,classification_winner,sadness,anger,neutral,happiness,fear,disgust,boredom
./data/emodb/emodb/wav/12a01Fb.wav,0 days,0 days 00:00:01.863625,neutral,0.11576763540506363,0.1414959877729416,0.3593694567680359,0.05933323875069618,0.08951663225889206,0.12100014835596085,0.11351688951253891
./data/emodb/emodb/wav/12a01Wc.wav,0 days,0 days 00:00:02.358812500,neutral,0.12048673629760742,0.1446247100830078,0.25808465480804443,0.04279503598809242,0.0794658437371254,0.25803136825561523,0.09651164710521698

It makes sense that almost all predicted labels are neutral, because emodb was designed to have linguistically neutral emotional content.

Following the winner class are the logits for all candidate classes.

Nkululeko: how to compare classifiers, features and databases using multiple runs

With nkululeko since version 0.98 there is a functionality to compare the outcome for several runs across experiments.

Say, you would like to know if the difference between using acoustic (opensmile) features and linguistic embeddings (bert) as features for some classifier is significant. You could than use the outcomes of several runs from one MLP (multi layer perceptron) as tests that represent all possible runs (disclaimer: afaik this approach is disputable according to some statisticians).

You would set up your experiment like this:

[EXP]
...
runs = 10
epochs = 100
[FEATS]
type = ['bert']
#type = ['os']
#type = ['os', 'bert']
[MODEL]
type = mlp
...
patience = 5
[EXPL]
# turn on extensive statistical output
print_stats = True
[PLOT]
runs_compare = features

and run this three times, each time changing the feature type that is being used (bert, os, or the combination of both), so in the end you got a results folder three different run_results as text files in it.

Using this, nkululeko prints a plot that compares the three feature sets, here's a example (having used only 5 runs):

The title states the overall significance for all differences, as well as the largest one for pair-wise comparison. If you run-number is larger than 30, t-tests will be used instead of Mann-Whitney.

Nkululeko tutorial: voice of wellness workshop

Context

In Sep 2025, we did the Voice of wellness workshop.

In this post i try the nkululeko experiments i use for the tutorials there.

Prepare the Database

i use the Androids corpus, paper here

First thing you should probably do is check the data formats and re-sample if necessary.

[RESAMPLE]
# which of the data splits to re-sample: train, test or all (both)
sample_selection = all
replace = True
target = data_resampled.csv

Explore

Check the database distributions

python -m nkululeko.explore --config data/androids/exp.in

Transcribe and translate

transcribe Note! this should be done on a GPU

translate, no GPU required as it uses a Google service

Segment

Androids database samples are quite long sometimes.
It makes sense to check if approaches work better on shorter speech segments.

python -m nkululeko.segment --config data/androids/exp.ini

Filter the data

[DATA]
data.limit_samples_per_speaker = 8
data.filter = [['task', 'interview']]
check_size = 1000

Define splits

Either use pre-defined folds:

[MODEL]
logo=5

or, randomly define splits, but stratify them:

[DATA]
data.split_strategy = balanced
data.balance = {'depression':2, 'age':1, 'gender':1}
data.age_bins = 2

Add additional training data

More details here

[DATA]
databases = ['data', 'emodb']
data.split_strategy = speaker_split
# add German emotional data
emodb = ./data/emodb/emodb
# rename emotion to depression
emodb.colnames = {"emotion": "depression"}
# only use neutral and sad samples
emodb.filter = [["depression", ["neutral", "sadness"]]]
# map them to depression
emodb.mapping = {"neutral": "control", "sadness": "depressed"}
# and put everything to the training
emodb.split_strategy = train
target = depression
labels = ['depressed', 'control']

Nkululeko: how to align databases

Sometimes you might want to combine databases that are similar, or alike, but don't handle exactly the same phenomena.

Take for example stress and emotion, you don't have enough data that labels stress, but many emotion databases that label anger and happiness. You might try the approach to use angry samples as stressed and happy or neutral as non-stressed.

Taking the usual emodb as example, and famous Susas as a database sampling stressed voices, you can do this like this:

[DATA]
databases = ['emodb', 'susas']

emodb = ./data/emodb/emodb
# indicate where the target values are
emodb.target_tables = ["emotion"]
# rename emotion to stress
emodb.colnames = {"emotion": "stress"}
# only use angry, neutral and happy samples
emodb.filter = [["stress", ["anger", "neutral", "happiness"]]]
# map them to stress
emodb.mapping = {"anger": "stress",  "neutral": "no stress", "happiness": "no stress"}
# and put everything to the training
emodb.split_strategy = train

susas = data/susas/
# map ternary stress labes to binary
susas.mapping = {'0,1':'no stress', '2':'stress'}
susas.split_strategy = speaker_split

target = stress
labels = ["stress", "no stress"]

So Susas will be split into train and test, but the training will be strenghend by the whole of emodb. This usually makes actually more sense if a third database is available for evaluation, because in-domain machine learning in most of the cases always works better than adding out-of-domain data (like we do here with emodb).

Nkululeko: using uncertainty

With nkululeko since version 0.94 (aleatoric) uncertainty, i.e. the confidence of the model, is explicitly visualized. You simply find a plot in the image folder after running an experiment, like so:

You see the distribution for true vs. false predictions wrt. uncertainty, i.e. in this case this worked out quite well (because less uncertain prediction are usually correct).

The approach is described in our paper Uncertainty-Based Ensemble Learning For Speech Classification

You can use this to tweak your results if you specify an uncertainty-threshold, i.e. you refuse to predict sample that are above some threshold:

[PLOT]
uncertainty_thresshold = .4

You will than get additionally a confusion plot that only takes the selected samples into account.

This might feel like cheating, but especially in critical use cases it might be better to deliver not prediction than a wrong one.

Nkululeko: feature scaling

As described in this previous post, features scaling can be quite important in machine learning.

With nkululeko since version 0.97 you have a multitude if scaling methods at hand.

You simply state in the config:

[FEATS]
scale = xxx

For xxx you specify the scaling methods are

  • standard: z-transformation (mean of 0 and std of 1) based on the training set
    • robust: robust scaler
  • speaker: like standard but based on individual speaker sets (also for the test)
  • bins: convert feature values into 0, .5 and 1 (for low, mid and high)
  • minmax: rescales the data set such that all feature values are in the range [0, 1]
  • maxabs: similar to MinMaxScaler except that the values are mapped across several ranges depending on whether negative OR positive values are present
  • normalizer: scales each sample (row) individually to have unit norm (e.g., L2 norm)
  • powertransformer: applies a power transformation to each feature to make the data more Gaussian-like in order to stabilize variance and minimize skewness
  • quantiletransformer: applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution (range [0, 1])

Nkululeko: how to explicitly model linguistics

With nkululeko since version 0.96 you there are linguistic feature extractors, i.e. using the text of the spoḱen words as input.

Of course you can combine them with acoustic features and use any fitting model architecture with it.

[EXP]
# optional: language for linguistics
language = de

[DATA]
data = ../mydata
# the linguistic feature extractors require a column named "text"
# example, perhaps not needed!
data.col_names = {"transcription":"text"}

[FEAT]
# combine linguistic bert features with acoustic open smile features
type = ['bert', 'os']

[MODEL]
type = xgb