Category Archives: nkululeko

Nkululeko: show feature importance

Since version 0.40, Nkululeko can now show the best performing X acoustic features according to some model.

There is a new section call EXPL (short for exploration), and you could state

[EXPL]
feature_distributions = True
model = ['tree', 'gmm', 'svm']
permutation = True
max_feats = 5
sample_selection = test

in your config file, and then run the exploration module like this:

python -m nkululeko.explore --config my_config.ini

The following values can be set in the config:

  • feature_distributions: use this analysis.
  • sample_selection: Which sample set/split to use, either all, train or test
  • model: Which models to use to estimate feature importance. Can be all models from the MODEL section, If several are stated, the mean result is used.
  • max_feats: The best n feats are shown
  • permutation: use feature permutation to determine the best features. Make sure to test the models before.

The resulting list will then appear in the result folder and a barplot image in the image folder.

Afterwards you could inspect single features as described here

Nkululeko: how to plot distributions of feature values

As shown in this post, with Nkululeko you can select only specific features from your features sets by specifying them in the [FEAT] section:

[FEATS]
features = ['JitterPCA', 'meanF0Hz', 'hld_sylRate']

What you can also do, is plotting them per category (only for classification), by specifying in the PLOT section if you would like that for all samples or only test or train samples:

[EXPL]
# turn it on
feature_distributions = True 
# use only training samples
sample_selection = train 
# only plot the 5 most important features 
max_feats = 5  

You would have to call nkululeko with the explore interface:

python -m nkululeko.explore --config <myConfig.ini>

The image file is in the image folder and should look similar to this:

Nkululeko: how to predict many samples

There are three ways to predict a number of samples:

  1. If you want to save the predictions of an experiment for later use, you can do so by stating in the EXP section

    [EXP]
    save_test = ./my_saved_test_predictions.csv

    The output format is CSV, comma seperated values.

  2. Alternatively, you can test an existing database against the best model you trained before, by stating the databases as tests in the DATA section:

    [DATA]
    tests = ['my_testdb']
    my_testdb = /mypath/my_testdb
    ...

    and then calling Nkululeko's test module

    python -m nkululeko.test --config mycoonfg.ini --outfile myresults.csv
  3. Run the demo module simply for a set of files:

    python -m nkululeko.demo --config mycoonfg.ini --list my_filelist.txt

Nkululeko

This is the entry post for Nkululeko: a framework to do machine learning experiments on audio data based on configuration files.

Here's an overview on the tutorials:

How to import acoustic features with the Nkululeko software

With nkululeko you usually compute acoustic features with some software, but you can also simply import them from a text file.

You simply use the import as feature extraction type, and specify the file locations (can of course be only one) like this:

[FEATS]
type = ['import']
import_file = ['tests/results/exp_emodb_praat/store/emodb_praat_test.csv', 'tests/results/exp_emodb_praat/store/emodb_praat_train.csv']

The import file(s) must be in CSV (comma separated values) format and contain either a file index (a column with file names) or an audformat segmented index.

The index must match with the one from your data, else you will get the error:

Imported features for data set XXX not found!

You can combine with other features, e.g.

[FEATS]
type = ['import', 'os']

and select features, just like with any other feature sets.

Nkululeko: How to evaluate a test set with a given best model

Nululeko has two modules for testing and unknown data set, despite train and development/evaluation set.

Let's recap the concept of train/dev/test splits:

  • train is used to train a supervised model
  • dev is a set to evaluate this model, i.e. know when it is a good model (that doesn't overfit)
  • test is a set to be used ONLY once: for the real use of the model. If you would use the test as a dev set, you can't be sure if you're not overfitting again (because you used the dev set to adjust the meta parameters of your model).

So, in order to evaluate a third dataset ( beneath train and dev) you might have situations:
a) you have a labeled test set and want to evaluate it
b) you have an unknown test set (no labels) and want to add predictions (without evaluation)

For a),
you can use the test module, and set a tests entry in the configuration [DATA] section like so:

[DATA]
tests = ['my_testdb']
my_testdb = /mypath/my_testdb
my_testdb.split_strategy = test
...

and then call Nkululeko's test module

python -m nkululeko.test --config mycoonfg.ini --outfile myresults.csv

For b),
you can use the demo module and state your test set as a list of files like so:

python -m nkululeko.demo --config my_config.ini --list my_testsamples.csv --outfile my_results.csv

In order to use a model, of course you do need to have it trained and saved before. So you need a run with the nkululeko module before.

python -m nkululeko.nkululeko --config my_config.ini

with my_config,ini containing:

[EXP]
save = True
[MODEL]
save = True

Nkululeko FAQ

This post is the start of a troubleshooting /FAQ list

The number of speakers in my test/train splits seems wrong

  • Did you check if perhaps the speakers from several datasets are the same? If so, you can add a dataname.rename_speakers = True to the configuration. This will prepend the dataset name to the speakers

How to combine feature sets with Nkululeko

If you want to use combine several acoustic parameter (feature) sets with nkululeko, you might state

[FEATS]
type = ['mld', 'praat']
features = ['JitterPCA', 'meanF0Hz', 'hld_sylRate']

This would combine the

  • hld_sylRate feature from MLD
  • JitterPCA feature from Feinberg's Praat features and
  • meansF0Hz feature from Feinberg's Praat features

Of course you could omit the features entry and simply use all of them.

It's interesting to see how many emotions from Berlin Emodb can still be recognized with only these three parameters:

How to use selected features from Praat with Nkululeko

If you want to use acoustic parameters extracted by the wonderful Praat software with nkululeko, you state

[FEATS]
type=['praat']

in the feature section of your config file.
If you like to use only some features of all the ones that are extracted by David R. Feinberg's Praat scripts, you can look at the output and select some of them in the FEAT section, e.g.

type = ['praat']
praat.features = ['speechrate(nsyll / dur)']

You can do the same with opensmile features:

type = ['os']
os.features = ['F0semitoneFrom27.5Hz_sma3nz_amean']

or even combine them

type = ['praat', 'os']
praat.features = ['speechrate(nsyll / dur)']
os.features = ['F0semitoneFrom27.5Hz_sma3nz_amean']

this is actually the same as

type = ['praat', 'os']
features = ['speechrate(nsyll / dur)', 'F0semitoneFrom27.5Hz_sma3nz_amean']

if you would want to combine all of opensmile eGeMAPS features with selected Praat features, you would do:

type = ['praat', 'os']
praat.features = ['speechrate(nsyll / dur)']

It is interesting to see, how many emotions of Berlin EmoDB still get recognized with only mean F0 and Jitter as features:

image

What kind of features are there, you might ask yoursel?
Here's a list:
'duration', 'meanF0Hz', 'stdevF0Hz', 'HNR', 'localJitter',
'localabsoluteJitter', 'rapJitter', 'ppq5Jitter', 'ddpJitter',
'localShimmer', 'localdbShimmer', 'apq3Shimmer', 'apq5Shimmer',
'apq11Shimmer', 'ddaShimmer', 'f1_mean', 'f2_mean', 'f3_mean',
'f4_mean', 'f1_median', 'f2_median', 'f3_median', 'f4_median',
'JitterPCA', 'ShimmerPCA', 'pF', 'fdisp', 'avgFormant', 'mff',
'fitch_vtl', 'delta_f', 'vtl_delta_f''

How to test a trained model on a new test set with Nkululeko

Sometimes you might want to test your already trained model(s) on a new dataset, e.g. because the training took a lot of resources.
If you stored your models during the training this is possible.

[DATA]
databases = ['emodb']
....
[MODEL]
save = True

In a new config file for your experiment that uses a dufferent test set, you set

[DATA]
databases = ['emodb', 'polish']
trains = ['emodb']
tests = ['polish']
strategy = cross_data....
[MODEL]
only_test = True

In the example above, emodb has been used as the training database, and polish in a second experiment later as a test database.