How to normalize features

"Normalizing" or scaling feature values means to shift them to a common range, or distribution with same mean and standard deviation (also called z-transformation).
You would do that for several reasons:

  • Artificial neural nets can handle small numbers best, so they all should be in the range -1, 1
  • Speakers have their individual ways to speak which you are not interested in if you want to learn a general task, e.g. emotion or age. So you would speaker-normalize the values for each speaker individually. Of course this is in most applications not possible because you don't have already samples of your test speakers.
  • You might want to normalize the sexes, because woman typicall have a higher pitch. But another way out is also to use only relative values and not absolute ones.

Mind that you shouldn't use your test set for normalization as it really only should be used for test and is supposed to be unknown. That's why you should compute your normalization parameters on the training set, you can then use them to normalize/scale the test.

Augmenting data

Often (kind of always) there is a lack of training data for supervised learning.

One way to tackle this is representation learning which can be done in an self-supervised fashion.

Another approach is to multiply your labeled training data by adding slightly altered versions of it, that would not change the information that is the aim of the detection, for example by adding noise to the data or clipping it. This is called augmentation and here is a post how to do this with nkululeko.

A third way is to synthesize data based on the labeled training, for example with GANs, VAEs or with rule-based simulation. It can be distinguished if in this case only a parameterized for of the samples (ie. the features) or whole audio files are generated.

Sometimes only samples for a rare class are needed, in this case techniques like ROS (random over sampling), Synthetic Minority Oversampling Technique (SMOTE) or the Adaptive Synthetic (ADASYN) can be used.
Here is a post how to do this with nkululeko


This is the entry post for Nkululeko: a framework to do machine learning experiments on audio data based on configuration files.

Here's an overview on the tutorials:

Meta parameter tuning

The parameters that configure machine learning algorithms are called meta parameters in contrast to the "normal" parameters that are learned during training.

But as they obviously also influence the quality of your predictions, these parameters also must be learned.

Examples are

  • the C parameter for SVM
  • the number of subsamples for XGB
  • the number of layers and neurons for a neural net

The naive approach is simply to try them all,
how to do this with Nkululeko is described here

But in general, because the search space for the optimal configuration usually is without limit, it'd be better to try a stochastic approach or a genetic one.

How to split your data

In supervised machine learning, you usually need three kinds of data sets:

  • train data: to teach the model the relation between data and labels
  • dev data: (short for development) to tune meta parameters of your model, e.g. number of neurons, batch size or learning rate.
  • test data: to evaluate your model ONCE at the end to check on generalization

Of course all this is to prevent overfitting on your train and/or dev data.

If you've used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments.

So what's a good split?

Some rules apply:

  • train and dev can be from the same set, but the test set is ideally from a different database.
  • if you don't have so much data, a 80/20/20 % split is normal
  • if you have masses an data, use only so much dev and test that your population seems covered.
  • If you have really little data: use x cross validation for train and dev, still the test set should be extra

Nkululeko exercise

Edit the demo configuration

Set/keep as target emotion as FEAT type os and as MODEL type xgb

Use the emodb as test and train set but try out all split methods

  • specified
  • speaker split
  • random
  • loso
  • logo
  • 5_fold_cross_validation

Which works best and why?

Set the

epochs = 200
type = mlp
layers = {'l1':1024, 'l2':64} 
save = True
epoch_progression = True
best_model = True

run the experiment.
Find the epoch progression plot and see at which epoch overfitting starts.