Import speech data to nkululeko

Often you simply start an experiment with some audio data that you got from somewhere in no special format. Often the labels are encoded in the filenames.
If so, this Python script can help to convert the audio to a Nkululeko readable format and generate a CSV (comma separated values) file.

import os
from audeer import list_file_names
from os.path import basename

# folder with the original audio files (in wav format)
root = './orig_wav/'
# output folder, empty at the beginning
out_dir = './audio/'
# name of the output file list
out_file = 'data.csv'

# get a list of wav files
list = list_file_names(root, filetype = 'wav', basenames=True, recursive=True)
# write the list header (change to your data)
with open(out_file, 'a') as the_file:
    the_file.write('file,type\n')
# for each file
for file in list:
    # get the file name without path
    fn = basename(file)
    # convert to 16kHz sampling rate and mono channel 
    os.system(f'sox {root+file} -r 16000 -c 1 {out_dir+fn}')
    # extract the annotation label from the file name (change this to your needs)
    label = fn[0]
    # lastly: add file to list 
    with open(out_file, 'a') as the_file:
        the_file.write(f'{out_dir+fn},{label}\n')

The resulting data list can then be read by Nkululeko in the config file (using randomly 30 % of the data as development set):

[DATA]
my_data = /some_path/data.csv
my_data.type = csv
my_data.split_strategy = random
my_data.testsplit = 30

How to limit a dataset with Nkululeko

In some cases you don't want to use the whole dataset for training or test, but filter it in some way. There are several filter possibilities in nkuluoleko:

  • limit_samples: limit the number of samples, randomly selected
  • limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where same speakers have a large number of samples)
  • min_duration_of_sample: limit the samples to a minimum length (in seconds)
  • max_duration_of_sample: limit the samples to a maximum length (in seconds)
  • filter: don't use all the data but only selected values from columns: [col, val].
    You can specify several filters in one: e.g.

    [DATA]
    filter = [['sex', 'female'], ['style', 'reading']]

    would use only the data where sex is female and style is reading

These can be specified per database:

[DATA]
databases = ['d1']
# force a specific feature to be present, e.g. gender labels ( when not all data has gender values)
d1.required = gender
# limit the absolute sample number
d1.limit_samples = 500
# limit the number of samples per speaker
d1.limit_samples_per_speaker = 20

Or for all samples, or the test and/or train splits

[DATA]
# only filter the training split: 
filter.sample_selection = train
# specify a minimum duration for train samples (in seconds)
min_duration_of_sample = 3.5
# use only samples where gender is female
filter = [['gender', 'female]]

Specifying database disk location with Nkululeko

Since version 0.13.0 with Nkululeko you can define all root folders for your databases at one single place.
This is very handy if you work in paralell on several computers, e.g. a development and a deployment environment.

In the [DATA] section of your ini file, you specify the path to the local data root folder file like this:

[DATA]
root_folders = data_roots.ini
databases = ['dataset_1']
...

and then within the data_roots.ini file (you can actually call it what you want), you declare the folders to your databases like this:

[DATA]
dataset_1 = /mypath/d1/
dataset_1.files_tables = ['files']
dataset_2 = ./d2
...

you can add all your data set options that you need in this file:

[DATA]
emodb = /mypath/d1/
emodb.split_strategy = speaker_split
emodb.testsplit = 40
emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'fear':'fright.', 'neutral':'neutral'}
dataset_2 = ./d2
dataset_2.files_tables = ['files_test', 'files_train']

If you define those fields in your experiment ini file, it will have precedence.

Kinds of machine learning

This post is an attempt to sort out some terms that are used around the topic of machine learning.

AI

meaning "artificial intelligence" is a term that the computer scientist John McCarthy used at the Dartmouth Conference 1956. It's really just a term to make the field sound more interesting. Until today all, so-called AI-systems are simply based on pattern recognition by statistics and I wouldn't know of a good model for human intelligence, or even a definition.

Soft/weak vs. strong AI

These are terms that are often used without a clear definition and they come from different traditions: 1) a philosophical one, meaning the difference between replicating the system vs. the signal, i.e. the allegory of the Chinese chamber; asking if someone in China not knowing Chinese who can answer Chinese questions by looking them up in a dictionary, is intelligent? 2) the difference between symbolic AI that works on intelligence models with expert knowledge vs. stochastic AI that uses data to detect underlying problem solving strategies and 3) what is usually meant in the current discussion with these terms is the distinction between a general AI that learns underlying principles to solve a number of problems, some of them even yet unknown, vs. a specialized AI that is focused on one problem, e.g. playing chess or driving a car.

Deep learning

is a fuzzy expression connected with artificial neural nets. What mostly is meant, is that the number of hidden layers (all layers apart from in- and output layer) is rather deep, but it remains unclear how many layers are needed. The more layers, the harder it is to handle the vanishing gradient problem, i.e. that the early layers don't get updated any more during training because the numbers become too small. Another interpretation (and one that makes more sense in my opinion) is, that deep learning refers to the raising level of abstraction of the hidden layers from raw input to abstract labels (e.g. picture pixels to animal names), especially with CNNs. For example, in the early layers mainly edges and contours are represented, in the later layers complex objects like beaks or eyes.

Classification vs. Regression

Means the difference if you want to predict on class/category out of a set of limited possibilities (classification) or a real value (in case of regression). Regression problems can be converted to classification by binning and classification to regression (in case the classes can be ordered by some criterion) with interpolation.

Hyperparameter learning/tuning

An artificial neural net has two kinds of parameters: the weights and biases that are learned during training, and the so-called hyper or meta parameters that do not changes during a training process, like for example the net architecture (number of layers / neurons per layer), the learning rate or other algorithmic constants. As they influence the performance they need to be learned as well and that's what the development aka evaluation split is for: i.e. a part of the data that is not used for training nor for the final test, but to evaluate the current hyperparameters. The easiest approach is a so called grid search, i.e. try all different combinations. But because the number of combinations grows exponentially with the number of tuned hyperparameters, a stochastic random based search or a learning algorithm is much more sensible.

Supervised vs. unsupervised

means the distinction whether your training data is annotated (or labeled) with respect to your task. An example: If you want to build a machine learner for human age estimation based on speech, you might give an algorithm a lot of examples of human speech annotated with the age of the person. This would be your training data and the approach would be supervised (by the age annotations). With unsupervised learning, you would give an algorithm simply a lot of human speech data and might ask it to cluster the data, based on differences. And might hope that the resulting clusters coincide with age.

Semi-supervised learning

means to use so-called soft labels for training, i.e. annotations that were generated by a machine learning predictor. If you have vast amounts of data but only a part of them is annotated, you might try to train a machine learner supervised with the annotated data and then use the resulting model to predict the rest. A variant would be to use the seed model to search for interesting data to be annotated, as for example rare events in you data.

Self-supervised learning

means techniques to prepare an artificial neural net by learning something on the data without predicting a concrete (supervised) feature/label/annotation. This can be done for example by masking parts of the data and training the net on predicting the masked parts as is done with the transformer technique. A different approach would be to use a triplet loss by training the net to distinguish near-by from far-away data. Once pretrained, a self-supervised net can be used for various down-stream tasks, like for example classification or regression of labeled data.

Reinforcement learning

is a fundamentally different kind of machine learning, usually metaphored with "learning like a child". Although mostly it is, it actually does not necessarily require to be implemented with artificial neural nets. The main idea is that an actor receives sensations from an environment that are interpreted and lead to new actions based on an evaluation criterion. As opposed to loss functions with neural learning, the evaluation criterion is rather abstract, like "reach the wall" (for a robot that should learn how to walk) or "win the game" for a chess player. It is rather tricky to translate evaluation criteria to concrete loss functions (needed to train a machine learner), and reinforcement learning requires vast amounts of data that usually are generated by simulations. That's the reason why, although very charming as an idea, reinforcement learning is successful mainly in gaming applications.

Representation learning

is learning to distinguish the essence of the data at hand from noise factors (that might come from the recording of the data). For example with speech it is mainly not interesting which microphones recorded the speakers and how the room acoustics was. All this is present in the data, but usually not important for the task at hand and, in fact, one of the reasons for over fitting (learning the training data but not the task) and lack of generalization (being able to recognize out-of-domain data: from different sources). As modeling data in machine learning is always a dimension reduction, representation learning searches for the dimensions that best represent the interesting aspects of the data. Self-supervised learning is a kind of representation learning.

Transfer learning

means to transfer knowledge from one domain to another. There are many ways to do this, for example you may pretrain your model with data from one domain and then finetune it with the data that represents your application. Self-supervised learning is also a kind of transfer learning. Another approach is multi task learning, where one large artificial neural net is trained for several tasks in parallel, with a so-called multi-head architecture (meaning the last layers are separated for each task).
The main idea is that you can use large quantities of data that are related to your task.

Pretraining and finetuning

means that you pretrain an artificial neural net with some large data sets that are related to you task, or at least have the same modality. You then remove the last layers of you net and add at least an output layer for your task. An example: wav2vec2.0 is a model trained on many hundreds of hours of speech data with ASR (automatic speech recognition) as a main target, but can be used as embeddings to classify emotional expression in speech.

One/few shot learning

means learning classes that have very few examples in the training data, by deriving information from related classes with more samples.

Zero shot learning

means to be able to predict classes that the machine learner has never seen at training time by using some auxiliary information, for example textual.

Adversarial learning

means generally the attempt to corrupt a model after it has been deployed in order to achieve some unexpected (from the developers and normal users) behavior. Examples would be to trick models into false classification by disguising the input in some form, re-engineering the model by learning from in- and output behavior or influencing the training data to harm the model.

Active learning

means that the machine learner itself is asking actively a so-called teacher (or oracle), often a human labeler, to annotate samples it is unsure of.

Curriculum learning

The technique of Curriculum learning is again inspired by human learning (like reinforcement learning), by copying the strategy to get better on a problem or ability to first look at clear and easily separatable samples and then progressivly the more difficult ones. This might prevent overfitting and definitely leads to models with much higher initial performance.

Contrastive learning

Contrastive learning is a kind of unsupervised learning by simply looking at different data items from a set and check the difference between them by contrasting similar from different items. This can be used for example by the triplet loss function where a data sample gets compared to one that is considered near-by (e.g. coming from the same context) and a third one that is considered to be far-away (e.g. a different context). The model is then trained without labels (apart from the context, which can usually be derived without labels) to distinguish between the positive and negative samples.

Federated / collaborative learning

means to distribute model training across a multitude of devices and/or servers, mainly to preserve privacy by not sharing the data but processing it on device and sending model updates.

Ensemble learning / Meta learning

means to use several machine learners and fuse the decisions later, either be rule or learned. Some boosting techniques are based on this idea already in the algorithm.

Continuous / life long learning

means ANN architectures that prevent the overwriting the weights (aka "forgetting") when new training data comes in. This is important when one task is learned continuously on data coming from different domains, e.g. a voice diary that shall learn emotion recognition through the day.

Disentangled representation learning

is a unsupervised method that means the idea to learn different aspects from the data with respect to their level of abstraction (instead of simply representing each data as a point in some space), by adding to the "raw" features some that are interpretable and independent from each other, e.g. speech rate and mean tone. This enhances interpretability/explainability and robustness.

Foundation model

A model that is trained, usually unsupervised, on very large quantities of data. The penultimate layer can then be used for so-called down-stream tasks, for example as automatically learned features.

Terminology

Loss function

is the function that artificial neural nets use to track progress, i.e. the function that evaluates the predicted outcome with the desired one. Finding a good loss function is crucial for your task.

Backpropagation

Fundamental way to train neural networks by evaluating the error with the loss function and than propagating it backwards towards the input layer, by taking the derivative.

Batch size

number of samples in one batch in the training which are used together to compute the error (-> loss function) and do the backpropagation step

Embeddings

are learned representations of data, usually the pen-ultimate layer of a pretrained artificial neural net.

Latent space

means the property of deep artificial neural nets to represent specific features of the data within the higher layers, for example speaker characteristics or expressed emotion in a net trained for speech synthesis. This is often used to influence the output in a desired way, for example simulating a specific speaking style.

Freezing

layers in an ANN means to not update the weights, as they might contain knowledge that should not be forgotten (from a pretrained net) or to make the training faster.

Drop out

is the technique to delete a number of randomly selected neurons in a hidden layer during training to prevent overfitting.

Patience

Number of epochs with no improvement after which training will be stopped.

Overfitting

means that the machine learner performs well on the training but not on any other data. This is usually the case when the model has enough complexity to distinguish all training data and is trained for enough periods (one period is one run through the training). Measures against this are subsumed under the label regularization.

Vanishing / exploding gradient

means that the weights of the neurons become too small or too large for the net to be stable. This happens especially with very deep (many layers) networks.

Bias vs. variance

means the trade-off between generalization (high bias, underfitting) and specification (high variance, overfitting). You can either
a) have simple models, like e.g. linear regression classifiers, that will treat every input with a similar strong bias (wrong decisions), irrespective of the training set, or
b) very complex models (e.g. a neural net with many layers) that will be more exact but very specific to your training data.
Here's a nice visualization of bias vs. variance.

ANN architectures

Perceptron

Perceptron is the original name that Minsky and Papert 1969 gave to the concept to model learning as a linear equation filtered by a non-linear function, inspired by the human neuron cell that fires only if a certain electric potential has been reached.

MLP/FFN- Multilayer Perceptron / -Feedforward Neural Network

Many perceptrons organized in layers of neurons, transforming information from input to output (one direction) while during training stage updating the weights (of the neurons/perceptrons) by the so.called backpropagation algorithm.
These are so-called vanilla networks because very simple (and the first ones which were developed), but still being used often as the last layers of a network to actually deliver a result. The main problem is the very large number of connections (and thus weights) and the vanishing gradient if they have many layers.

RNN - Recurrent Neural Network

is an ANN architecture where cells can have their own output as input, which means that the ANNs get a time dimension that does not exist with the older Feed Forward nets.

CNN - Convolutional Neural Network

are ANNs that reduce the number of connections between cells by introducing filters/patches that can be reused across the input field. This is inspired by techniques from image analysis. One of the big advantages (apart from reducing the number of weights) is that the layers are in parts interpretable as they become more and more high leven.

LSTM - Long-Shortterm Neural Network

Is a kind of RNN with memory cells. A simpler form is known as GRU (Gated recurrent units).

ResNet - Residual Neural Network

Are very deep ANNs that avoid the vanishing gradient problem by introducing skip connections, randomly placed and weighted (the weights are learned as well).

GANs - Generative Adversarial Networks

Is a combination of two networks: a generator that tries to replicate samples from a training set, and a discriminator that tries to distinguish the original and the generated samples. As both get better, the likeness of the fake samples get better.

VAE- Variational Autoencoders

Are two networks, an encoder and a decoder, the task is to restore a sample that was reduced to a lower dimension by the encoder. With variational AEs, the encoder decoder input can be interpolated and thus new samples as mixture of the original ones created. Another use case is representation learning, as due to the dimensionality reduction step it is learned which information in the signal is relevant for the nature of the samples.

Sources

I'd like to reveal some of my sources, much indebted:

Nkululeko: How to import a database

Nkululeko is a tool to ease machine learning on speech databases.
This tutorial should help you to import databases.
There are two formats upported:
1) csv (comma seperated values)
2) audformat

CSV format

The easiest is CSV, you simply create a table with the following informations:

  • file: the path to the audio file
  • task: is the speaker characteristics value that you want to explore, e.g. age or emotion, or both

and then fill it with values of your database. Optionally, your data can contain any amount of additional information in further columns. Some naming conventions are pre-defined:

  • speaker: speaker id, a string being unique for samples from one speaker
  • gender: biological sex
  • age: an integer between 0 and 100 denoting the age in years.

So a file for emotion might look like this

file, speaker, gender, emotion
<path to>/s12343.wav, s1, female, happy
...

You can then specify the data in your initialization file like this:

[DATA]
databases = ['my_db']
my_db.type = csv
my_db = <path to>/my_data_file.csv
my_db.absolute_path = False 
...
target = emotion

You should set the flag absolute_path depending on whether

  • the file paths start from the location of where you run Nkululeko (or start from root: /), then True
  • or they start from the location where the data resides, then False

(if in doubt, just try it out: there should be an error message that the audio files don't exist)

You can not specify split tables with this format, but would have to simply split the file in several databases.

There is an example on how to import the ravdess database here.

And this would be an example ini file to use it:

[EXP]
root = ./tests/results/
name = exp_ravdess
runs = 1
epochs = 1
save = True
[DATA]
databases = ['train', 'test', 'dev']
train = ../nkululeko/data/ravdess/ravdess_train.csv
train.type = csv
train.absolute_path = False
train.split_strategy = train
dev = ../nkululeko/data/ravdess/ravdess_dev.csv
dev.type = csv
dev.absolute_path = False
dev.split_strategy = train
test = ../nkululeko/data/ravdess/ravdess_test.csv
test.type = csv
test.absolute_path = False
test.split_strategy = test
target = emotion
labels = ['angry', 'happy', 'neutral', 'sad']
[FEATS]
type = ['os']
scale = standard
[MODEL]
type = xgb

I.e. the splits train and dev get concatenated to a common train set

Fun fact: the result is:

audformat

audformat allows for many usecases, so the specification might be more complex.
So in the easiest case you have a database with two tables, one called files that contains the speaker informations (id and sex) and one called like your task (aka target), so for example age or emotion.
That's the case for our demo example, the Berlin EmoDB, ando so you can include it simply with.

[DATA]
databases = ['emodb']
emodb = /<path to>/emodb/
target = emotion
...

But if there are more tables and they have special names, you can specifiy them like this:

[DATA]
databases = ['msp']
# path to data
msp = /<path to>/msppodcast/
# tables with speaker information
msp.files_tables =  ['files.test-1', 'files.train']
# tables with task labels
msp.target_tables =  ['emotion.test-1', 'emotion.train']
# train and evaluation splits will be provided
msp.split_strategy = specified
# here are the test/evaluatoin split tables
msp.test_tables = ['emotion.test-1']
# here are the training tables
msp.train_tables = ['emotion.train']
target = emotion

Nkululeko: classifying continuous variables

Nkululeko supports classification and regression.
Classification means predicting a class (or category) from data, regression predicting a continuous value, as for example the speaker age in years.

If you want to use classification with continuous variables, you need to first bin it, which means that you put the values into pre-defined bins. To stay with our age example, you'd declare everyone above 50 years as old and all other as young.

This post shows you how to do that with Nkululeko by setting up your .ini file.

You set up the experiment as classification type:

[EXP]
...
type = classification

But declare the data to be continuous:

[DATA]
...
type = continuous
labels = ['u40', '40ies', '50ies', '60ies', 'ΓΌ70']
bins  = [-1000,  40, 50, 60, 70, 1000]

Then the data will be binned according to the sepecified bins and labeled accordingly.
You need (number of labels) + 1 values for the bins, as they are given lower and upper limit. It makes sense to set the lower and upper absolute limits extreme as you don't know what the classifier will predict.

How to soft-label a database with Nkululeko

Soft-labeling means to annotate data with labels that were predicited by a machine classifier.
As they were not evaluated by a human, you might call them "soft".

Two steps are necessary:
1) save a test/evaluation set as a new database
2) load this new database in a new experiment as training data

Within nkululeko, you would do it like this:

step 1: save a new database

You simply specifify a name in the EXP section to save the test predictions like this:

[EXP]
...
save_test = ./my_test_predictions.csv

You need Model save to be turned on, because it will look for the best performing model:

[MODEL]
...
save = True

This will store the new database to a file called my_test_predictions.csv in the folder where the python was called.

step 2: load as training data

Here is an example configuration how to load this data as additional training (in this case in addition to emodb):

[EXP]
root = ./tests/
name = exp_add_learned
runs = 1
epochs = 1
[DATA]
strategy = cross_data
databases = ['emodb', 'learned', 'test_db']
trains = ['emodb', 'learned']
tests = ['test_db']
emodb = /path-to-emodb/
emodb.split_strategy = speaker_split
emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'}
test_db = /path to test database/
test_db.mapping = <if any mapping to the target categories is needed>
learned = ./my_test_predictions.csv
learned.type = csv
target = emotion
labels = ['angry', 'happy', 'neutral', 'sad']
[FEATS]
type = os
[MODEL]
type = xgb
save = True
[PLOT]