Category Archives: Allgemein

Nkululeko: oversample the training set

Sometimes, with categorically labeled data, the number of samples per class is very unevenly distributed, misleading the model to think that the overwhelming majority class is more important than the others.
In this case, two techniques might help: class weighting assigns a higher weight to samples from minority classes, and oversampling "invents" new samples for the minority classes.
With nkululeko since version 0.70.0, you can oversample the trainig set with different algorithms implemented by the imb_learn package.
You simply state the method in the FEATS section like so:

[FEATS]
...
balancing = adasyn # either ros, smote or adasyn

Three methods are available:

  • ros: simply repeat random samples from the minority classes
  • smote: "invent" new minority samples by little changes from the existing ones
  • adasyn: similar to smote, but resulting in uneven class distributions

Nkululeko: re-name data column names

With nkululeko since version 0.68.1, you can re-name data fields (columns n your data table) by setting the following in your ini-file:

[DATA]
databases = ['mydata']
mydata.colnames = {'speaker':'Participant ID', 'sex':'gender', 'Age': 'age'}

which means, that, before further processing, the Participant ID field in your database mydata will be treated as speaker label and so on.

Nkululeko: automatically stratify your split sets

With nkululeko since version 0.68.0, the selection of test/dev vs. train samples can be done automatically in a stratified manner, i.e. trying to find splits that are age or gender balanced.
An example for such a configuration is this:

[DATA]
# the name of the database
databases = ['emodb']
# the location of the data
emodb = ./data/emodb/emodb
# set the split strategy to "balanced"
emodb.split_strategy = balanced
# set a percentage value for your test split
emodb.test_size = 20
# stratify variables with weights for importance
balance = {'emotion':2, 'age':1, 'gender':1}
# all stratification variables need to be categorical, 
# so we need to state the number of bins for "age" 
age_bins = 2
# a value for how much importance to give for the ideal group sizes
size_diff_weight = 1
# the target value of the experiment
target = emotion

Nkululeko will always keep the speaker variable disjunct, i.e. resulting splits will contain different speakers.
With the example above, the algorithm will try to balance emotion, gender and (binned) age distributions across the splits.

Nkululeko: inspect your data with Spotlight

With nkululeko since version 0.67.0, the spotlight software is directly integrated as part of the EXPLORE module.

You can simply run your data filters, augmentations, machine learning experiments, segmentations and model predictions as usual, and then call the spotlight software by adding to your configuration file:

[EXPL]
sample_selection = all # or train or test 
spotlight = True 

and running the EXPORE module

python -m nkululeko.explore --config myconfig.ini

Note that you might require to install an extra package:

pip install renumics-spotlight

A new web browser window should open as an interface to spotlight:

Torchaudio

If you use modules, feature-extractors or models that use torchaudio with Nkululeko, like e.g . Resampler or Squim model, you need to install the nightly version.

pip uninstall -y torch torchvision torchaudio
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Nkululeko: get some statistics on correlation and effect size

With nkululeko since version 0.64.0, some statistics are printed as part of the plot's titles.
With the explore module, you can plot correlations between the target (e.g. emotion or age) and other variables that are in the database, e.g. gender or duration, or everything you might have predicted with the predict module.
You need to differentiate if any of your variables is categorical/nominal (strings) or real valued (numbers).

If you plot the distribution of two categories, the Chi2 statistics is used to estimate if the correlation is significant. A p-value is given in the title, like e.g. in this plot:

If you plot the distribution of two real-valued variables, the correlation will be estimated by the Pearson's correlation coefficient:

If the target is categorical and the variable real-valued, we use Cohen's d to print out the maximal effect size of a category combination and the variable:

If the target is also real-valued, it will be binned (made categorical) per default:

If you want to prevent that, you can set a value in the configuration:

[EXPL]
bin_reals = False

and you get a different plot:

Nkululeko: how to predict labels for your data from existing models and check them

With nkululeko since version 0.58.0, you can predict labels automatically for a given database, and then perhaps use these predictions to check on bias within your data.
One example:
You have a database labeled with smokers/non-smokers. You evaluate a machine learning model, check on the features and find to your astonishment, that the mean pitch is the most important feature to distinguish between smokers and non-smokers, with a very high accuracy.
You suspect foul-play and auto-label the data with a public model predicting biological sex (called gender in Nkululeko).
After a data exploration you see that most of the smokers are female and most of the non-smokers are male.
The machine learning model detected biological sex and not smoking behaviour.

How do you do this?
Firstly, you need to predict labels. In a configuration file, state the annotations you'd like to be added to your data like this:

[DATA]
databases = ['mydata']
mydata = ... # location of the data
mydata.split_strategy = random # not important for this 
...
[PREDICT]
# the label names that should be predicted: possible are: 'gender', 'age', 'snr', 'valence', 'arousal', 'dominance', 'pesq', 'mos'
targets = ['gender']
# the split selection, use "all" for all samples in the database
sample_selection = all

You can then call the predict module with python:

python -m nkululeko.predict --config my_config.ini

The resulting new database file in CSV format will appear in the experiment folder.
The newly predicted values will be named with a trailing _pred, e.g. "gender_pred" for "gender"
You can than configure the explore module to visualize the the correlation between the new labels and the original target:

[DATA]
databases = ['predicted']
predicted = ./my_exp/mydata_predicted.csv
predicted.type = csv
predicted.absolute_path = True
predicted.split_strategy = random
...
[EXPL]
# which labels to investigate in context with target label
value_counts = [['gender_pred']]
# the split selection
sample_selection = all

and then call the explore module:

python -m nkululeko.explore --config my_config.ini

The resulting visualizations are in the image folder of the experiment folder.
Here is an example of the correlation between emotion and estimated PESQ (Perceptual Evaluation of Speech Quality)

The effect size is stated as Cohen's d, for categories that have the largest value, in this case the difference of estimated speech quality is largest between the categories neutral and angry.

Nkululeko: visualize clusters of your acoustic features

It can be very interesting to reduce the dimensionality of your acoustic or learned features to two or three dimensions and then color the single samples features with the label.

Nkululeko supports three different ways to reduce the dimensionality:

  • pca: Principal Componen Analysis
  • tsne: t-distributed stochastic neighbor embedding
    • perplexity=30, learning_rate=200
  • umap: Uniform Manifold Approximation and Projection
    • n_neighbors=10, random_state=0

To do this, you simply state your data and features as usual. The approaches you want to use can be set in the scatter field of the EXPL section:

[EXPL]
scatter = ['umap', 'tsne', 'pca']

(of course you don't have to use all) and then call the explore interface

python -m nkululeko.explore --config myconfig.ini

You can do this for all columns in your data, not only the target value.
If you want a scatter plot for a different target, state it like this (example):

[EXPL]
scatter = ['pca']
scatter.target = ['gender', 'age', 'likability']

The images appear in the image folder of your experiment and might look like this (all from the same data):

PCA

T-SNE

UMAP

The emotion cube

There is a multitude of ways to model emotions, and some of them are collected in the EmotionML vacabularies.
Really popular with engineers and non-psychologists are two approaches:

  • discreet categories like anger, sadness, fear or joy, often associated with an intensity.
  • continuous dimensions like valence/pleasure, arousal or dominance

The emotion cube maps the emotional categories to a three dimensional space:

How to normalize features

"Normalizing" or scaling feature values means to shift them to a common range, or distribution with same mean and standard deviation (also called z-transformation).
You would do that for several reasons:

  • Artificial neural nets can handle small numbers best, so they all should be in the range -1, 1
  • Speakers have their individual ways to speak which you are not interested in if you want to learn a general task, e.g. emotion or age. So you would speaker-normalize the values for each speaker individually. Of course this is in most applications not possible because you don't have already samples of your test speakers.
  • You might want to normalize the sexes, because woman typicall have a higher pitch. But another way out is also to use only relative values and not absolute ones.

Mind that you shouldn't use your test set for normalization as it really only should be used for test and is supposed to be unknown. That's why you should compute your normalization parameters on the training set, you can then use them to normalize/scale the test.