Category Archives: nkululeko

How to do cross validation with Nkululeko

Only for linear classifiers like XGB, SVM, SGR and SVR you have the possibility to disregard training and development splits and do a cross validation, i.e. validate one data set in a circular manner against itself.

The basic idea is that you take part of the data and evaluate against the rest, and in the next round take another part and so forth, until all data has been evaluated. Because the speaker identity is so strong in speech, this is done usually in a speaker exclusive manner, known under the term "leave one speaker out " (LOSO).

If you have too many speakers and/or each speaker really only one sample, you might want to split your speakers into groups and do a "leave one speaker group out" strategy (LOGO).

A related approach is known under the name k fold cross validation, where k usually equals 10.
When you only have one sample per speaker, this might make more sense.
So, how would you do that with Nkululeko?
First, you would define a training and development split for your data anyway, because Nkululeko is expecting it if there is only one database. You might set that to random, it's not used anyway:

[DATA]
mydata.split_strategy = random 

Then in your config file, you specify in the MODEL section either:

[MODEL] 
logo = 10 

to assign 10 groups to your speakers and then evaluate each group against all others.
If you want to do a leave-one-speaker_out experiment (LOSO), simply assign the number for logo the number of your speakers.

If there already is a fold column in your data, this will be used, otherwise Nkululeko will randomly assign folds to speakers.

Or you do

[MODEL] 
k_fold_cross = 5 

for instance to disregard speaker information and simply evaluate 5 times a fifth of the data against the rest.
We use stratified sets, i.e. the algorithm tries to balance the class data within each set.

Import speech data to nkululeko

Often you simply start an experiment with some audio data that you got from somewhere in no special format. Often the labels are encoded in the filenames.
If so, this Python script can help to convert the audio to a Nkululeko readable format and generate a CSV (comma separated values) file.

import os
from audeer import list_file_names
from os.path import basename

# folder with the original audio files (in wav format)
root = './orig_wav/'
# output folder, empty at the beginning
out_dir = './audio/'
# name of the output file list
out_file = 'data.csv'

# get a list of wav files
list = list_file_names(root, filetype = 'wav', basenames=True, recursive=True)
# write the list header (change to your data)
with open(out_file, 'a') as the_file:
    the_file.write('file,type\n')
# for each file
for file in list:
    # get the file name without path
    fn = basename(file)
    # convert to 16kHz sampling rate and mono channel 
    os.system(f'sox {root+file} -r 16000 -c 1 {out_dir+fn}')
    # extract the annotation label from the file name (change this to your needs)
    label = fn[0]
    # lastly: add file to list 
    with open(out_file, 'a') as the_file:
        the_file.write(f'{out_dir+fn},{label}\n')

The resulting data list can then be read by Nkululeko in the config file (using randomly 30 % of the data as development set):

[DATA]
my_data = /some_path/data.csv
my_data.type = csv
my_data.split_strategy = random
my_data.testsplit = 30

How to limit a dataset with Nkululeko

In some cases you don't want to use the whole dataset for training or test, but filter it in some way. There are several filter possibilities in nkuluoleko:

  • limit_samples: limit the number of samples, randomly selected
  • limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where same speakers have a large number of samples)
  • min_duration_of_sample: limit the samples to a minimum length (in seconds)
  • max_duration_of_sample: limit the samples to a maximum length (in seconds)
  • filter: don't use all the data but only selected values from columns: [col, val].
    You can specify several filters in one: e.g.

    [DATA]
    filter = [['sex', 'female'], ['style', 'reading']]

    would use only the data where sex is female and style is reading

These can be specified per database:

[DATA]
databases = ['d1']
# force a specific feature to be present, e.g. gender labels ( when not all data has gender values)
d1.required = gender
# limit the absolute sample number
d1.limit_samples = 500
# limit the number of samples per speaker
d1.limit_samples_per_speaker = 20

Or for all samples, or the test and/or train splits

[DATA]
# only filter the training split: 
filter.sample_selection = train
# specify a minimum duration for train samples (in seconds)
min_duration_of_sample = 3.5
# use only samples where gender is female
filter = [['gender', 'female]]

Specifying database disk location with Nkululeko

Since version 0.13.0 with Nkululeko you can define all root folders for your databases at one single place.
This is very handy if you work in paralell on several computers, e.g. a development and a deployment environment.

In the [DATA] section of your ini file, you specify the path to the local data root folder file like this:

[DATA]
root_folders = data_roots.ini
databases = ['dataset_1']
...

and then within the data_roots.ini file (you can actually call it what you want), you declare the folders to your databases like this:

[DATA]
dataset_1 = /mypath/d1/
dataset_1.files_tables = ['files']
dataset_2 = ./d2
...

you can add all your data set options that you need in this file:

[DATA]
emodb = /mypath/d1/
emodb.split_strategy = speaker_split
emodb.testsplit = 40
emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'fear':'fright.', 'neutral':'neutral'}
dataset_2 = ./d2
dataset_2.files_tables = ['files_test', 'files_train']

If you define those fields in your experiment ini file, it will have precedence.

Nkululeko: How to import a database

Nkululeko is a tool to ease machine learning on speech databases.
This tutorial should help you to import databases.
There are two formats upported:
1) csv (comma seperated values)
2) audformat

CSV format

The easiest is CSV, you simply create a table with the following informations:

  • file: the path to the audio file
  • task: is the speaker characteristics value that you want to explore, e.g. age or emotion, or both

and then fill it with values of your database. Optionally, your data can contain any amount of additional information in further columns. Some naming conventions are pre-defined:

  • speaker: speaker id, a string being unique for samples from one speaker
  • gender: biological sex
  • age: an integer between 0 and 100 denoting the age in years.

So a file for emotion might look like this

file, speaker, gender, emotion
<path to>/s12343.wav, s1, female, happy
...

You can then specify the data in your initialization file like this:

[DATA]
databases = ['my_db']
my_db.type = csv
my_db = <path to>/my_data_file.csv
my_db.absolute_path = False 
...
target = emotion

You should set the flag absolute_path depending on whether

  • the file paths start from the location of where you run Nkululeko (or start from root: /), then True
  • or they start from the location where the data resides, then False

(if in doubt, just try it out: there should be an error message that the audio files don't exist)

You can not specify split tables with this format, but would have to simply split the file in several databases.

There is an example on how to import the ravdess database here.

And this would be an example ini file to use it:

[EXP]
root = ./tests/results/
name = exp_ravdess
runs = 1
epochs = 1
save = True
[DATA]
databases = ['train', 'test', 'dev']
train = ../nkululeko/data/ravdess/ravdess_train.csv
train.type = csv
train.absolute_path = False
train.split_strategy = train
dev = ../nkululeko/data/ravdess/ravdess_dev.csv
dev.type = csv
dev.absolute_path = False
dev.split_strategy = train
test = ../nkululeko/data/ravdess/ravdess_test.csv
test.type = csv
test.absolute_path = False
test.split_strategy = test
target = emotion
labels = ['angry', 'happy', 'neutral', 'sad']
[FEATS]
type = ['os']
scale = standard
[MODEL]
type = xgb

I.e. the splits train and dev get concatenated to a common train set

Fun fact: the result is:

audformat

audformat allows for many usecases, so the specification might be more complex.
So in the easiest case you have a database with two tables, one called files that contains the speaker informations (id and sex) and one called like your task (aka target), so for example age or emotion.
That's the case for our demo example, the Berlin EmoDB, ando so you can include it simply with.

[DATA]
databases = ['emodb']
emodb = /<path to>/emodb/
target = emotion
...

But if there are more tables and they have special names, you can specifiy them like this:

[DATA]
databases = ['msp']
# path to data
msp = /<path to>/msppodcast/
# tables with speaker information
msp.files_tables =  ['files.test-1', 'files.train']
# tables with task labels
msp.target_tables =  ['emotion.test-1', 'emotion.train']
# train and evaluation splits will be provided
msp.split_strategy = specified
# here are the test/evaluatoin split tables
msp.test_tables = ['emotion.test-1']
# here are the training tables
msp.train_tables = ['emotion.train']
target = emotion

Nkululeko: classifying continuous variables

Nkululeko supports classification and regression.
Classification means predicting a class (or category) from data, regression predicting a continuous value, as for example the speaker age in years.

If you want to use classification with continuous variables, you need to first bin it, which means that you put the values into pre-defined bins. To stay with our age example, you'd declare everyone above 50 years as old and all other as young.

This post shows you how to do that with Nkululeko by setting up your .ini file.

You set up the experiment as classification type:

[EXP]
...
type = classification

But declare the data to be continuous:

[DATA]
...
type = continuous
labels = ['u40', '40ies', '50ies', '60ies', 'ΓΌ70']
bins  = [-1000,  40, 50, 60, 70, 1000]

Then the data will be binned according to the sepecified bins and labeled accordingly.
You need (number of labels) + 1 values for the bins, as they are given lower and upper limit. It makes sense to set the lower and upper absolute limits extreme as you don't know what the classifier will predict.

How to soft-label a database with Nkululeko

Soft-labeling means to annotate data with labels that were predicited by a machine classifier.
As they were not evaluated by a human, you might call them "soft".

Two steps are necessary:
1) save a test/evaluation set as a new database
2) load this new database in a new experiment as training data

Within nkululeko, you would do it like this:

step 1: save a new database

You simply specifify a name in the EXP section to save the test predictions like this:

[EXP]
...
save_test = ./my_test_predictions.csv

You need Model save to be turned on, because it will look for the best performing model:

[MODEL]
...
save = True

This will store the new database to a file called my_test_predictions.csv in the folder where the python was called.

step 2: load as training data

Here is an example configuration how to load this data as additional training (in this case in addition to emodb):

[EXP]
root = ./tests/
name = exp_add_learned
runs = 1
epochs = 1
[DATA]
strategy = cross_data
databases = ['emodb', 'learned', 'test_db']
trains = ['emodb', 'learned']
tests = ['test_db']
emodb = /path-to-emodb/
emodb.split_strategy = speaker_split
emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'}
test_db = /path to test database/
test_db.mapping = <if any mapping to the target categories is needed>
learned = ./my_test_predictions.csv
learned.type = csv
target = emotion
labels = ['angry', 'happy', 'neutral', 'sad']
[FEATS]
type = os
[MODEL]
type = xgb
save = True
[PLOT]

Nkululeko: try out / demo a trained model

This is another Nkululeko post that shows you how to demo model that you trained before.

First you need to train a model, e.g. on emodb as shown here
In the ini file you MUST set the parameters

[EXP]
...
save = True

in the general section and

[MODEL]
...
save = True

in the MODEL section of the configuration file.
Background: we need both the experiment as well as all model files to be saved on disk.

Here is an example script then how to call the demo mode:

python -m nkululeko.demo --config exp_emodb.ini

, if your config file is called exp_emob.ini
Automatically the best performing model will be used.
This will start recording for 3 seconds from your microphone.
if you specify

python -m nkululeko.demo --config exp_emodb.ini --file test.wav

the file test.wav will be predicted, it needs to be in 16 kHz sampling rate and mono channel.
If you specify

python -m nkululeko.demo --config exp_emodb.ini --list my_list.txt --folder data/ravdess/ --outfile my_results.csv

The file my_list.txt will be read and it is expected to contain one file to be predicted per line, e.g.:

tests/a.wav
tests/b.wav

The optional argument --folder can be used to specify a parent folder for the input files.

The optional argument --outfile can be used to save the results in a CSV table.

If --list my_file.csv is in CSV format, audformat will be interpreted, but only the index will be used.

E.g. with

file,emotion
tests/a.wav,happy
tests/b.wav,angry

only the file column will be used.

How to set up wav2vec embedding for nkululeko

Since version 0.10, nkululeko supports facebook's wav2vec 2.0 embeddings as acoustic features.
This post shows you how to set this up.

set up nkululeko

in your nkululeko configuration (*.ini) file, set the feature extractor as wav2vec2 and denote the path to the model like this:

[FEATS]
type = ['wav2vec2']
wav2vec.model = /my path to the huggingface model/

Alternatively you can state the huggingface model name directly:

[FEATS]
type = ['wav2vec2-base-960h']

Out of the box, as embeddings the last hidden layer is used. But the original wav2vec2 model consists of 7 CNN layers followed by up to 24 transformer layers. if you like to use an earlier layer than the last one, you can simply count down-

[FEATS]
type = wav2vec2
wav2vec.layer = 12

This would use the 12th layer of a 24 layer model and only the4 CNN layers of a 12 layer model.