How to use train, dev and test splits with Nkululeko

Usually in machine learning, you train your predictor on a train set, tune meta-parameters on a dev (development or validation set ) and evaluate on a test set.
With nkululeko, there currently the test set is not, as there are only two sets that can be specified: train and evaluation set.
A work-around is to use the test module to evaluate your best model on a hold out test set at the end of your experiments.
All you need to do is to specify the name of the test data in your [DATA] section, like so (let's call it myconf.ini):

save = True
databases =  ['my_train-dev_data']
tests = ['my_test_data']
my_test_data = ./data/my_test_data/
my_test_data.split_strategy = test

you can run the experiment module with your config:

python -m nkululeko.nkululeko --config myconf.ini

and then, after optimization (of predictors, features sets and meta-parameters), use the test module

python -m nkululeko.test --config myconf.ini

The results will appear at the same place as all other results, but the files are named with test and the test database as a suffix.

If you need to compare several predictors and feature sets, you can use the nkuluflag module
All you need to do, is, in your main script, if you call the nkuluflag module, pass a parameter (named --mod) to tell it to use the test module:

cmd = 'python -m nkululeko.nkuluflag --config myconf.ini  --mod test '

Nkululeko: how to tweak the target variable for database comparison

Sometimes you want to compare two different databases that share a similar target variable, say, related to likability, but in a different scaling, say the one asked on a scale from 1 to 10 and the other used likert-scale from 1-7.

With nkululeko you can re-name labels, normalize the target values, and even inverse the polarity, for each databases.

In the following example there are two databases

databases = ['db1', 'db2']
db1.split_strategy = test
db1.scale = True
db2.colnames = {'non-attractive':'likability'}
db2.split_strategy = train
db2.scale = True
db2.reverse = True
db2.reverse.max = 10
target = likability
bins = [-1000, .2, 1000]
labels = ['less likable', 'more likable']

The one database db1 already has a likability label and just needs to be standard-normalized, the second one db2 has a related label non-attractive which needs to be renamed, inverted (based on a hypothetical maximum value of 10) and normalized.
Then, db1 can be used as test data and db2 as training.

Nkululeko: how to bin/discretize your feature values

With nkululeko since version 0.77.8 you have the possibility to convert all feature values into the discreet classes low, mid and high

Simply state

type = ['praat']
scale = bins
store_format = csv

in your config to use Praat features.
With the store format stated as csv you will be able to look at the train and test features in the store folder.

The binning will be done based on the 33 and 66 percent of the training feature values.

Nkululeko: compare several databases

With nkululeko since version 0.77.7 there is a new interface named multidb which lets you compare several databases.

You can state their names in the [EXP] section and they will then be processed one after each other and against each other, the results are stored in a file called heatmap.png in the experiment folder.


Here is an example for such an ini.file:

root = ./experiments/emodbs/
#  DON'T give it a name, 
# this will be the combination 
# of the two databases: 
# traindb_vs_testdb
epochs = 1
databases = ['emodb', 'polish']
root_folders = ./experiments/emodbs/data_roots.ini
target = emotion
labels = ['neutral', 'happy', 'sad', 'angry']
type = ['os']
type = xgb

you can (but don't have to), state the specific dataset values in an external file like above.

emodb = ./data/emodb/emodb
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']
emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'}
polish = ./data/polish_emo
polish.mapping = {'anger':'angry', 'joy':'happy', 'sadness':'sad', 'neutral':'neutral'}
polish.split_strategy = speaker_split
polish.test_size = 30

Call it with:

python -m nkululeko.multidb --config my_conf.ini

Here's a result with two databases:

and this is the same experiment, but with augmentations:

In order to add augmentation, simply add an [AUGMENT] section:

root = ./experiments/emodbs/augmented/
epochs = 1
databases = ['emodb', 'polish']
augment = ['traditional', 'random_splice']

In order to add an additional training database to all experiments, you can use:

train_extra = [meta, emodb]

, to add two databases to all training data sets,
where meta and emodb should then be declared in the root_folders file

Nkululeko: oversample the training set

Sometimes, with categorically labeled data, the number of samples per class is very unevenly distributed, misleading the model to think that the overwhelming majority class is more important than the others.
In this case, two techniques might help: class weighting assigns a higher weight to samples from minority classes, and oversampling "invents" new samples for the minority classes.
With nkululeko since version 0.70.0, you can oversample the trainig set with different algorithms implemented by the imb_learn package.
You simply state the method in the FEATS section like so:

balancing = adasyn # either ros, smote or adasyn

Three methods are available:

  • ros: simply repeat random samples from the minority classes
  • smote: "invent" new minority samples by little changes from the existing ones
  • adasyn: similar to smote, but resulting in uneven class distributions

Nkululeko: re-name data column names

With nkululeko since version 0.68.1, you can re-name data fields (columns in your data table) by setting the following in your ini-file:

databases = ['mydata']
mydata.colnames = {'Participant ID':'speaker', 'sex':'gender', 'Age': 'age'}

which means, that, before further processing, the Participant ID field in your database mydata will be treated as speaker label and so on.

Nkululeko: automatically stratify your split sets

With nkululeko since version 0.68.0, the selection of test/dev vs. train samples can be done automatically in a stratified manner, i.e. trying to find splits that are age or gender balanced.
An example for such a configuration is this:

# the name of the database
databases = ['emodb']
# the location of the data
emodb = ./data/emodb/emodb
# set the split strategy to "balanced"
emodb.split_strategy = balanced
# set a percentage value for your test split
emodb.test_size = 20
# stratify variables with weights for importance
balance = {'emotion':2, 'age':1, 'gender':1}
# all stratification variables need to be categorical, 
# so we need to state the number of bins for "age" 
age_bins = 2
# a value for how much importance to give for the ideal group sizes
size_diff_weight = 1
# the target value of the experiment
target = emotion

Nkululeko will always keep the speaker variable disjunct, i.e. resulting splits will contain different speakers.
With the example above, the algorithm will try to balance emotion, gender and (binned) age distributions across the splits.

Nkululeko: inspect your data with Spotlight

With nkululeko since version 0.67.0, the spotlight software is directly integrated as part of the EXPLORE module.

You can simply run your data filters, augmentations, machine learning experiments, segmentations and model predictions as usual, and then call the spotlight software by adding to your configuration file:

sample_selection = all # or train or test 
spotlight = True 

and running the EXPORE module

python -m nkululeko.explore --config myconfig.ini

Note that you might require to install an extra package:

pip install renumics-spotlight

A new web browser window should open as an interface to spotlight:

Nkululeko: generate a latex/pdf report

With nkululeko since version 0.66.3, a report document formatted in Latex and compiled as a PDF file can automatically be generated, basically as a compilation of the images that are generated.
There is a dedicated REPORT section in the config file for this, here is an example:

# should the report be shown in the terminal at the end?
show = False 
# should a latex/pdf file be printed? if so, state the filename
latex = emodb_report
# name of the experiment author (default "anon")
author = Felix
# title of the report (default "report")
title = EmoDB

with each run of a nkululeko module in the same experiment environment, the details of the report will be added.
So a typical use would be, to first run the general module and than more specialized ones:

# first run a segmentation 
python -m nkululeko.segment --config myconf.ini 
# then rename the data-file in the config.ini and
# run some data exploration
python -m nkululeko.explore --config myconf.ini 
# then run a machine learning experiment
python -m nkululeko.nkululeko --config myconf.ini 

Each run will add some contents to the report


If you use modules, feature-extractors or models that use torchaudio with Nkululeko, like e.g . Resampler or Squim model, you need to install the nightly version.

pip uninstall -y torch torchvision torchaudio
pip install --pre torch torchvision torchaudio --extra-index-url