Nkululeko: oversample the training set

Sometimes, with categorically labeled data, the number of samples per class is very unevenly distributed, misleading the model to think that the overwhelming majority class is more important than the others.
In this case, two techniques might help: class weighting assigns a higher weight to samples from minority classes, and oversampling "invents" new samples for the minority classes.
With nkululeko since version 0.70.0, you can oversample the trainig set with different algorithms implemented by the imb_learn package.
You simply state the method in the FEATS section like so:

balancing = adasyn # either ros, smote or adasyn

Three methods are available:

  • ros: simply repeat random samples from the minority classes
  • smote: "invent" new minority samples by little changes from the existing ones
  • adasyn: similar to smote, but resulting in uneven class distributions

Nkululeko: re-name data column names

With nkululeko since version 0.68.1, you can re-name data fields (columns n your data table) by setting the following in your ini-file:

databases = ['mydata']
mydata.colnames = {'speaker':'Participant ID', 'sex':'gender', 'Age': 'age'}

which means, that, before further processing, the Participant ID field in your database mydata will be treated as speaker label and so on.

Nkululeko: automatically stratify your split sets

With nkululeko since version 0.68.0, the selection of test/dev vs. train samples can be done automatically in a stratified manner, i.e. trying to find splits that are age or gender balanced.
An example for such a configuration is this:

# the name of the database
databases = ['emodb']
# the location of the data
emodb = ./data/emodb/emodb
# set the split strategy to "balanced"
emodb.split_strategy = balanced
# set a percentage value for your test split
emodb.test_size = 20
# stratify variables with weights for importance
balance = {'emotion':2, 'age':1, 'gender':1}
# all stratification variables need to be categorical, 
# so we need to state the number of bins for "age" 
age_bins = 2
# a value for how much importance to give for the ideal group sizes
size_diff_weight = 1
# the target value of the experiment
target = emotion

Nkululeko will always keep the speaker variable disjunct, i.e. resulting splits will contain different speakers.
With the example above, the algorithm will try to balance emotion, gender and (binned) age distributions across the splits.

Nkululeko: inspect your data with Spotlight

With nkululeko since version 0.67.0, the spotlight software is directly integrated as part of the EXPLORE module.

You can simply run your data filters, augmentations, machine learning experiments, segmentations and model predictions as usual, and then call the spotlight software by adding to your configuration file:

sample_selection = all # or train or test 
spotlight = True 

and running the EXPORE module

python -m nkululeko.explore --config myconfig.ini

Note that you might require to install an extra package:

pip install renumics-spotlight

A new web browser window should open as an interface to spotlight:

Nkululeko: generate a latex/pdf report

With nkululeko since version 0.66.3, a report document formatted in Latex and compiled as a PDF file can automatically be generated, basically as a compilation of the images that are generated.
There is a dedicated REPORT section in the config file for this, here is an example:

# should the report be shown in the terminal at the end?
show = False 
# should a latex/pdf file be printed? if so, state the filename
latex = emodb_report
# name of the experiment author (default "anon")
author = Felix
# title of the report (default "report")
title = EmoDB

with each run of a nkululeko module in the same experiment environment, the details of the report will be added.
So a typical use would be, to first run the general module and than more specialized ones:

# first run a segmentation 
python -m nkululeko.segment --config myconf.ini 
# then rename the data-file in the config.ini and
# run some data exploration
python -m nkululeko.explore --config myconf.ini 
# then run a machine learning experiment
python -m nkululeko.nkululeko --config myconf.ini 

Each run will add some contents to the report


If you use modules, feature-extractors or models that use torchaudio with Nkululeko, like e.g . Resampler or Squim model, you need to install the nightly version.

pip uninstall -y torch torchvision torchaudio
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Nkululeko: get some statistics on correlation and effect size

With nkululeko since version 0.64.0, some statistics are printed as part of the plot's titles.
With the explore module, you can plot correlations between the target (e.g. emotion or age) and other variables that are in the database, e.g. gender or duration, or everything you might have predicted with the predict module.
You need to differentiate if any of your variables is categorical/nominal (strings) or real valued (numbers).

If you plot the distribution of two categories, the Chi2 statistics is used to estimate if the correlation is significant. A p-value is given in the title, like e.g. in this plot:

If you plot the distribution of two real-valued variables, the correlation will be estimated by the Pearson's correlation coefficient:

If the target is categorical and the variable real-valued, we use Cohen's d to print out the maximal effect size of a category combination and the variable:

If the target is also real-valued, it will be binned (made categorical) per default:

If you want to prevent that, you can set a value in the configuration:

bin_reals = False

and you get a different plot:

How to fix different sampling rates in a dataset with Nkululeko

With nkululeko since version 0.62.0 you can automatically adjust the sampling rate to the standard of 16 kHz, which is required by most models that might need to process your data.

A special module can be configured in the configuration file like this:

# which of the data splits to re-sample: train, test or all (both)
sample_selection = all
replace = True
target = data_resampled.csv

and then you call it like this

python -m nkululeko.resample --config my_config.ini

WARNING: if replace = True, this changes (overwrites) ALL files in the splits, directly on your hard disk. Make sure to make a safety copy of your database before, in case the results are undesired, or you still need the data in other sample rates.

The default value though, is replace = False . Then, the target value will be used as filename for the new dataframe with filenames that indicate that the sampling rate has been changed.

As stated above, only files in the test and train splits are affected. This means that you can use all filtering, e.g. limit samples per speaker to 20 samples to pre-select samples.

Nkululeko: how to predict labels for your data from existing models and check them

With nkululeko since version 0.58.0, you can predict labels automatically for a given database, and then perhaps use these predictions to check on bias within your data.
One example:
You have a database labeled with smokers/non-smokers. You evaluate a machine learning model, check on the features and find to your astonishment, that the mean pitch is the most important feature to distinguish between smokers and non-smokers, with a very high accuracy.
You suspect foul-play and auto-label the data with a public model predicting biological sex (called gender in Nkululeko).
After a data exploration you see that most of the smokers are female and most of the non-smokers are male.
The machine learning model detected biological sex and not smoking behaviour.

How do you do this?
Firstly, you need to predict labels. In a configuration file, state the annotations you'd like to be added to your data like this:

databases = ['mydata']
mydata = ... # location of the data
mydata.split_strategy = random # not important for this 
# the label names that should be predicted: possible are: 'gender', 'age', 'snr', 'valence', 'arousal', 'dominance', 'pesq', 'mos'
targets = ['gender']
# the split selection, use "all" for all samples in the database
sample_selection = all

You can then call the predict module with python:

python -m nkululeko.predict --config my_config.ini

The resulting new database file in CSV format will appear in the experiment folder.
The newly predicted values will be named with a trailing _pred, e.g. "gender_pred" for "gender"
You can than configure the explore module to visualize the the correlation between the new labels and the original target:

databases = ['predicted']
predicted = ./my_exp/mydata_predicted.csv
predicted.type = csv
predicted.absolute_path = True
predicted.split_strategy = random
# which labels to investigate in context with target label
value_counts = [['gender_pred']]
# the split selection
sample_selection = all

and then call the explore module:

python -m nkululeko.explore --config my_config.ini

The resulting visualizations are in the image folder of the experiment folder.
Here is an example of the correlation between emotion and estimated PESQ (Perceptual Evaluation of Speech Quality)

The effect size is stated as Cohen's d, for categories that have the largest value, in this case the difference of estimated speech quality is largest between the categories neutral and angry.

Nkululeko: segmenting a database

Segmenting a database means to split the audio samples of a database into smaller segments or chunks. With speech data this is usually done on the basis of VAD, aka voice activity detection, meaning that the pauses between speech in the audio samples are used as segment borders.

The reason for segmenting could be to label the data with something that would not last over the whole sample, e.g. emotional state.
Another motivation to segment audio data might be that the acoustic features are targeted at a specific stretch of audio, e.g. 3-5 seconds long.

Within nkululeko this would be done with the segment module, which is currently based on the silero software.

You simply call your experiment configuration with the segment module, and the train, test set or both will be segmented.
The advantage is, that you can use all filters on your data that might make sense beforehand, for example with the android corpus, only the reading task samples are not segmented.
You can select them like so:

filter = [['task', 'reading']]

and then call the segment module:

python -m nkululeko.segment --config my_conf.ini

The output is a new database file in CSV format.

If you want, you can specify if only the training, or test split, or both should be segmented, as well as the string that is added to the name of the resulting csv file (the name per default consists of the database names):

# name postfix
target = _segmented
# which model to use
method = silero
# which split: train, test or all (both)
sample_selection = all
# the minimum lenght of rest-samples (in seconds)
min_length = 2
# the maximum length of segments, longer ones are cut here.  (in seconds)
max_length = 10