How to limit a dataset with Nkululeko

In some cases you don't want to use the whole dataset for training or test, but filter it in some way. There are several possibilities demonstrated:
Some are valid per database:

databases = ['d1']
# force a specific feature to be present, e.g. gender labels ( when not all data has gender values)
d1.required = gender

# limit the number of samples per speaker
d1.max_samples_per_speaker = 20

# only use the first 10000 samples
d1.limit = 10000

Others are valid for the whole experiment, i.e. all databases

# specify a minimum duration for test samples (in seconds)
min_dur_test = 3.5

# use only samples where gender is female
sex = female

Specifying database disk location with Nkululeko

Since version 0.13.0 with Nkululeko you can define all root folders for your databases at one single place.
This is very handy if you work in paralell on several computers, e.g. a development and a deployment environment.

In the [DATA] section of your ini file, you specify the path to the local data root folder file like this:

root_folders = data_roots.ini
databases = ['dataset_1']
dataset_1.files_tables = ['files']

and then within the data_roots.ini file (you can actually call it what you want), you declare the folders to your databases like this:

dataset_1 = /mypath/d1/
dataset_2 = ./d2

you can add all your dataset options that you need in this file (actually you have to do this, currently a mixture is not supported):

emodb = /mypath/d1/
emodb.split_strategy = speaker_split
emodb.testsplit = 40
dataset_2 = ./d2
dataset_2.files_tables = ['files_test', 'files_train']

Kinds of machine learning

This post is an attempt to sort out some terms that are used around the topic of machine learning.


meaning "artificial intelligence" is a term that the computer scientist John McCarthy used at the Dartmouth Conference 1956. It's really just a term to make the field sound more interesting. Until today all, so-called AI-systems are simply based on pattern recognition by statistics and I wouldn't know of a good model for human intelligence, or even a definition.

Soft/weak vs. strong AI

These are terms that are often used without a clear definition and they come from different traditions: 1) a philosophical one, meaning the difference between replicating the system vs. the signal, i.e. the allegory of the Chinese chamber; asking if someone in China not knowing Chinese who can answer Chinese questions by looking them up in a dictionary, is intelligent? 2) the difference between symbolic AI that works on intelligence models with expert knowledge vs. stochastic AI that uses data to detect underlying problem solving strategies and 3) what is usually meant in the current discussion with these terms is the distinction between a general AI that learns underlying principles to solve a number of problems, some of them even yet unknown, vs. a specialized AI that is focused on one problem, e.g. playing chess or driving a car.

Deep learning

is a fuzzy expression connected with artificial neural nets. What mostly is meant, is that the number of hidden layers (all layers apart from in- and output layer) is rather deep, but it remains unclear how many layers are needed. The more layers, the harder it is to handle the vanishing gradient problem, i.e. that the early layers don't get updated any more during training because the numbers become too small. Another interpretation (and one that makes more sense in my opinion) is, that deep learning refers to the raising level of abstraction of the hidden layers from raw input to abstract labels (e.g. picture pixels to animal names), especially with CNNs. For example, in the early layers mainly edges and contours are represented, in the later layers complex objects like beaks or eyes.

Hyperparameter learning/tuning

An artificial neural net has two kinds of parameters: the weights and biases that are learned during training, and the so-called hyper or meta parameters that do not changes during a training process, like for example the net architecture (number of layers / neurons per layer), the learning rate or other algorithmic constants. As they influence the performance they need to be learned as well and that's what the development aka evaluation split is for: i.e. a part of the data that is not used for training nor for the final test, but to evaluate the current hyperparameters. The easiest approach is a so called grid search, i.e. try all different combinations. But because the number of combinations grows exponentially with the number of tuned hyperparameters, a stochastic random based search or a learning algorithm is much more sensible.

Supervised vs. unsupervised

means the distinction whether your training data is annotated (or labeled) with respect to your task. An example: If you want to build a machine learner for human age estimation based on speech, you might give an algorithm a lot of examples of human speech annotated with the age of the person. This would be your training data and the approach would be supervised (by the age annotations). With unsupervised learning, you would give an algorithm simply a lot of human speech data and might ask it to cluster the data, based on differences. And might hope that the resulting clusters coincide with age.

Semi-supervised learning

means to use so-called soft labels for training, i.e. annotations that were generated by a machine learning predictor. If you have vast amounts of data but only a part of them is annotated, you might try to train a machine learner supervised with the annotated data and then use the resulting model to predict the rest. A variant would be to use the seed model to search for interesting data to be annotated, as for example rare events in you data.

Self-supervised learning

means techniques to prepare an artificial neural net by learning something on the data without predicting a concrete (supervised) feature/label/annotation. This can be done for example by masking parts of the data and training the net on predicting the masked parts as is done with the transformer technique. A different approach would be to use a triplet loss by training the net to distinguish near-by from far-away data. Once pretrained, a self-supervised net can be used for various down-stream tasks, like for example classification or regression of labeled data.

Reinforcement learning

is a fundamentally different kind of machine learning, usually metaphored with "learning like a child". Although mostly it is, it actually does not necessarily require to be implemented with artificial neural nets. The main idea is that an actor receives sensations from an environment that are interpreted and lead to new actions based on an evaluation criterion. As opposed to loss functions with neural learning, the evaluation criterion is rather abstract, like "reach the wall" (for a robot that should learn how to walk) or "win the game" for a chess player. It is rather tricky to translate evaluation criteria to concrete loss functions (needed to train a machine learner), and reinforcement learning requires vast amounts of data that usually are generated by simulations. That's the reason why, although very charming as an idea, reinforcement learning is successful mainly in gaming applications.

Representation learning

is learning to distinguish the essence of the data at hand from noise factors (that might come from the recording of the data). For example with speech it is mainly not interesting which microphones recorded the speakers and how the room acoustics was. All this is present in the data, but usually not important for the task at hand and, in fact, one of the reasons for over fitting (learning the training data but not the task) and lack of generalization (being able to recognize out-of-domain data: from different sources). As modeling data in machine learning is always a dimension reduction, representation learning searches for the dimensions that best represent the interesting aspects of the data. Self-supervised learning is a kind of representation learning.

Transfer learning

means to transfer knowledge from one domain to another. There are many ways to do this, for example you may pretrain your model with data from one domain and then finetune it with the data that represents your application. Self-supervised learning is also a kind of transfer learning. Another approach is multi task learning, where one large artificial neural net is trained for several tasks in parallel, with a so-called multi-head architecture (meaning the last layers are separated for each task).
The main idea is that you can use large quantities of data that are related to your task.

Pretraining and finetuning

means that you pretrain an artificial neural net with some large data sets that are related to you task, or at least have the same modality. You then remove the last layers of you net and add at least an output layer for your task. An example: wav2vec2.0 is a model trained on many hundreds of hours of speech data with ASR (automatic speech recognition) as a main target, but can be used as embeddings to classify emotional expression in speech.


means that the machine learner performs well on the training but not on any other data. This is usually the case when the model has enough complexity to distinguish all training data and is trained for enough periods (one period is one run through the training). Measures against this are subsumed under the label regularization.

Loss function

is the function that artificial neural nets use to track progress, i.e. the function that evaluates the predicted outcome with the desired one. Finding a good loss function is crucial for your task.


are learned representations of data, usually the pen-ultimate layer of a pretrained artificial neural net.

Latent space

means the property of deep artificial neural nets to represent specific features of the data within the higher layers, for example speaker characteristics or expressed emotion in a net trained for speech synthesis. This is often used to influence the output in a desired way, for example simulating a specific speaking style.

Zero shot learning

means to be able to predict classes that the machine learner has never seen at training time by using some auxiliary information, for example textual.

Adversarial learning

means generally the attempt to corrupt a model after it has been deployed in order to achieve some unexpected (from the developers and normal users) behavior. Examples would be to trick models into false classification by disguising the input in some form, re-engineering the model by learning from in- and output behavior or influencing the training data to harm the model.

Active learning

means that the machine learner itself is asking actively a so-called teacher (or oracle), often a human labeler, to annotate samples it is unsure of.

Bias vs. variance

means the trade-off between generalization (high bias, underfitting) and specification (high variance, overfitting). You can either
a) have simple models, like e.g. linear regression classifiers, that will treat every input with a similar strong bias (wrong decisions), irrespective of the training set, or
b) very complex models (e.g. a neural net with many layers) that will be more exact but very specific to your training data

Curriculum learning

The technique of Curriculum learning is again inspired by human learning (like reinforcement learning), by copying the strategy to get better on a problem or ability to first look at clear and easily separatable samples and then progressivly the more difficult ones. This might prevent overfitting and definitely leads to models with much higher initial performance.

Contrastive learning

Contrastive learning is a kind of unsupervised learning by simply looking at different data items from a set and check the difference between them by contrasting similar from different items. This can be used for example by the triplet loss function where a data sample gets compared to one that is considered near-by (e.g. coming from the same context) and a third one that is considered to be far-away (e.g. a different context). The model is then trained without labels (apart from the context, which can usually be derived without labels) to distinguish between the positive and negative samples.


Perceptron is the original name that Minsky and Papert 1969 gave to the concept to model learning as a linear equation filtered by a non-linear function, inspired by the human neuron cell that fires only if a certain electric potential has been reached.

MLP/FFN- Multilayer Perceptron / -Feedforward Neural Network

Many perceptrons organized in layers of neurons, transforming information from input to output (one direction) while during training stage updating the weights (of the neurons/perceptrons) by the so.called backpropagation algorithm.
These are so-called vanilla networks because very simple, but still being used often as the last layers of a network to actually deliver a result.

RNN - Recurrent Neural Network

CNN - Convlutional Neural Netwrk

LSTM - Long-Shortterm Neural Network

ResNet - Residual Neural Network


I'd like to reveal some of my sources, much indebted: