How to limit a dataset with Nkululeko

In some cases you don't want to use the whole dataset for training or test, but filter it in some way. There are several possibilities demonstrated:
Some are valid per database:

[DATA]
databases = ['d1']
...
# force a specific feature to be present, e.g. gender labels ( when not all data has gender values)
d1.required = gender

# limit the number of samples per speaker
d1.max_samples_per_speaker = 20

# only use the first 10000 samples
d1.limit = 10000

Others are valid for the whole experiment, i.e. all databases

[DATA]
# specify a minimum duration for test samples (in seconds)
min_dur_test = 3.5

# use only samples where gender is female
sex = female

Specifying database disk location with Nkululeko

Since version 0.13.0 with Nkululeko you can define all root folders for your databases at one single place.
This is very handy if you work in paralell on several computers, e.g. a development and a deployment environment.

In the [DATA] section of your ini file, you specify the path to the local data root folder file like this:

[DATA]
root_folders = data_roots.ini
databases = ['dataset_1']
dataset_1.files_tables = ['files']
...

and then within the data_roots.ini file (you can actually call it what you want), you declare the folders to your databases like this:

[DATA]
dataset_1 = /mypath/d1/
dataset_2 = ./d2
...

you can add all your dataset options that you need in this file (actually you have to do this, currently a mixture is not supported):

[DATA]
emodb = /mypath/d1/
emodb.split_strategy = speaker_split
emodb.testsplit = 40
dataset_2 = ./d2
dataset_2.files_tables = ['files_test', 'files_train']

Kinds of machine learning

This post is an attempt to sort out some terms that are used around the topic of machine learning.

AI

meaning "artificial intelligence" is a term that the computer scientist John McCarthy used at the Dartmouth Conference 1956. It's really just a term to make the field sound more interesting. Until today all, so-called AI-systems are simply based on pattern recognition by statistics and I wouldn't know of a good model for human intelligence, or even a definition.

Deep learning

is a fuzzy expression connected with artificial neural nets. What mostly is meant, is that the number of hidden layers (all layers apart from in- and output layer) is rather deep, but it remains unclear how many layers are needed. The more layers, the harder it is to handle the vanishing gradient problem, i.e. that the early layers don't get updated any more during training because the numbers become too small. Another interpretation (and one that makes more sense in my opinion) is, that deep learning refers to the raising level of abstraction of the hidden layers from raw input to abstract labels (e.g. picture pixels to animal names), especially with CNNs. For example, in the early layers mainly edges and contours are represented, in the later layers complex objects like beaks or eyes.

Supervised vs. unsupervised

means the distinction whether your training data is annotated (or labeled) with respect to your task. An example: If you want to build a machine learner for human age estimation based on speech, you might give an algorithm a lot of examples of human speech annotated with the age of the person. This would be your training data and the approach would be supervised (by the age annotations). With unsupervised learning, you would give an algorithm simply a lot of human speech data and might ask it to cluster the data, based on differences. And might hope that the resulting clusters coincide with age.

Semi-supervised learning

means to use so-called soft labels for training, i.e. annotations that were generated by a machine learning predictor. If you have vast amounts of data but only a part of them is annotated, you might try to train a machine learner supervised with the annotated data and then use the resulting model to predict the rest. A variant would be to use the seed model to search for interesting data to be annotated, as for example rare events in you data.

Self-supervised learning

means techniques to prepare an artificial neural net by learning something on the data without predicting a concrete (supervised) feature/label/annotation. This can be done for example by masking parts of the data and training the net on predicting the masked parts as is done with the transformer technique. A different approach would be to use a triplet loss by training the net to distinguish near-by from far-away data. Once pretrained, a self-supervised net can be used for various down-stream tasks, like for example classification or regression of labeled data.

Reinforcement learning

is a fundamentally different kind of machine learning, usually metaphored with "learning like a child". Although mostly it is, it actually does not necessarily require to be implemented with artificial neural nets. The main idea is that an actor receives sensations from an environment that are interpreted and lead to new actions based on an evaluation criterion. As opposed to loss functions with neural learning, the evaluation criterion is rather abstract, like "reach the wall" (for a robot that should learn how to walk) or "win the game" for a chess player. It is rather tricky to translate evaluation criteria to concrete loss functions (needed to train a machine learner), and reinforcement learning requires vast amounts of data that usually are generated by simulations. That's the reason why, although very charming as an idea, reinforcement learning is successful mainly in gaming applications.

Representation learning

is learning to distinguish the essence of the data at hand from noise factors (that might come from the recording of the data). For example with speech it is mainly not interesting which microphones recorded the speakers and how the room acoustics was. All this is present in the data, but usually not important for the task at hand and, in fact, one of the reasons for over fitting (learning the training data but not the task) and lack of generalization (being able to recognize out-of-domain data: from different sources). As modeling data in machine learning is always a dimension reduction, representation learning searches for the dimensions that best represent the interesting aspects of the data. Self-supervised learning is a kind of representation learning.

Transfer learning

means to transfer knowledge from one domain to another. There are many ways to do this, for example you may pretrain your model with data from one domain and then finetune it with the data that represents your application. Self-supervised learning is also a kind of transfer learning. Another approach is multi task learning, where one large artificial neural net is trained for several tasks in parallel, with a so-called multi-head architecture (meaning the last layers are separated for each task).
The main idea is that you can use large quantities of data that are related to your task.

Pretraining and finetuning

means that you pretrain an artificial neural net with some large data sets that are related to you task, or at least have the same modality. You then remove the last layers of you net and add at least an output layer for your task. An example: wav2vec2.0 is a model trained on many hundreds of hours of speech data with ASR (automatic speech recognition) as a main target, but can be used as embeddings to classify emotional expression in speech.

Overfitting

means that the machine learner performs well on the training but not on any other data. This is usually the case when the model has enough complexity to distinguish all training data and is trained for enough periods (one period is one run through the training). Measures against this are subsumed under the label regularization.

Loss function

is the function that artificial neural nets use to track progress, i.e. the function that evaluates the predicted outcome with the desired one. Finding a good loss function is crucial for your task.

Embeddings

are learned representations of data, usually the pen-ultimate layer of a pretrained artificial neural net.

Latent space

means the property of deep artificial neural nets to represent specific features of the data within the higher layers, for example speaker characteristics or expressed emotion in a net trained for speech synthesis. This is often used to influence the output in a desired way, for example simulating a specific speaking style.

Zero shot learning

means to be able to predict classes that the machine learner has never seen at training time by using some auxiliary information, for example textual.

Adversarial learning

means generally the attempt to corrupt a model after it has been deployed in order to achieve some unexpected (from the developers and normal users) behavior. Examples would be to trick models into false classification by disguising the input in some form, re-engineering the model by learning from in- and output behavior or influencing the training data to harm the model.

Active learning

means that the machine learner itself is asking actively a so-called teacher (or oracle), often a human labeler, to annotate samples it is unsure of.

Bias vs. variance

means the trade-off between generalization (high bias, underfitting) and specification (high variance, overfitting). You can either
a) have simple models, like e.g. linear regression classifiers, that will treat every input with a similar strong bias (wrong decisions), irrespective of the training set, or
b) very complex models (e.g. a neural net with many layers) that will be more exact but very specific to your training data

Sources

I'd like to reveal some of my sources, much indebted: