Category Archives: Allgemein

Get your speech recognized with Whisper

OpenAI published new speech recognition models that are very easy to use and work in many languages trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

In my case all I had to do to recognize some German test:

# create a virtual environment
virtualenv venv
# activate it
. venv/bin/activate
# install whisper
pip install git+https://github.com/openai/whisper.git
# run the test
whisper test.wav --language German

And my file got recognized correctly, though it took a very long time: for the tiny model speed = x32, i.e. 32 times the time of the speech file duration, was announced

How to use selected features from Praat with Nkululeko

If you want to use acoustic parameters extracted by the wonderful Praat software with nkululeko, you state

[FEATS]
type=['praat']

in the feature section of your config file.
If you like to use only some features of all the ones that are extracted by David R. Feinberg's Praat scripts, you can look at the output and select some of them in the FEAT section, e.g.

type = ['praat']
praat.features = ['speechrate(nsyll / dur)']

You can do the same with opensmile features:

type = ['os']
os.features = ['F0semitoneFrom27.5Hz_sma3nz_amean']

or even combine them

type = ['praat', 'os']
praat.features = ['speechrate(nsyll / dur)']
os.features = ['F0semitoneFrom27.5Hz_sma3nz_amean']

this is actually the same as

type = ['praat', 'os']
features = ['speechrate(nsyll / dur)', 'F0semitoneFrom27.5Hz_sma3nz_amean']

if you would want to combine all of opensmile eGeMAPS features with selected Praat features, you would do:

type = ['praat', 'os']
praat.features = ['speechrate(nsyll / dur)']

It is interesting to see, how many emotions of Berlin EmoDB still get recognized with only mean F0 and Jitter as features:

image

What kind of features are there, you might ask yoursel?
Here's a list:
'duration', 'meanF0Hz', 'stdevF0Hz', 'HNR', 'localJitter',
'localabsoluteJitter', 'rapJitter', 'ppq5Jitter', 'ddpJitter',
'localShimmer', 'localdbShimmer', 'apq3Shimmer', 'apq5Shimmer',
'apq11Shimmer', 'ddaShimmer', 'f1_mean', 'f2_mean', 'f3_mean',
'f4_mean', 'f1_median', 'f2_median', 'f3_median', 'f4_median',
'JitterPCA', 'ShimmerPCA', 'pF', 'fdisp', 'avgFormant', 'mff',
'fitch_vtl', 'delta_f', 'vtl_delta_f''

Kinds of machine learning

This post is an attempt to sort out some terms that are used around the topic of machine learning.

AI

meaning "artificial intelligence" is a term that the computer scientist John McCarthy used at the Dartmouth Conference 1956. It's really just a term to make the field sound more interesting. Until today all, so-called AI-systems are simply based on pattern recognition by statistics and I wouldn't know of a good model for human intelligence, or even a definition.

Soft/weak vs. strong AI

These are terms that are often used without a clear definition and they come from different traditions: 1) a philosophical one, meaning the difference between replicating the system vs. the signal, i.e. the allegory of the Chinese chamber; asking if someone in China not knowing Chinese who can answer Chinese questions by looking them up in a dictionary, is intelligent? 2) the difference between symbolic AI that works on intelligence models with expert knowledge vs. stochastic AI that uses data to detect underlying problem solving strategies and 3) what is usually meant in the current discussion with these terms is the distinction between a general AI that learns underlying principles to solve a number of problems, some of them even yet unknown, vs. a specialized AI that is focused on one problem, e.g. playing chess or driving a car.

Deep learning

is a fuzzy expression connected with artificial neural nets. What mostly is meant, is that the number of hidden layers (all layers apart from in- and output layer) is rather deep, but it remains unclear how many layers are needed. The more layers, the harder it is to handle the vanishing gradient problem, i.e. that the early layers don't get updated any more during training because the numbers become too small. Another interpretation (and one that makes more sense in my opinion) is, that deep learning refers to the raising level of abstraction of the hidden layers from raw input to abstract labels (e.g. picture pixels to animal names), especially with CNNs. For example, in the early layers mainly edges and contours are represented, in the later layers complex objects like beaks or eyes.

Classification vs. Regression

Means the difference if you want to predict on class/category out of a set of limited possibilities (classification) or a real value (in case of regression). Regression problems can be converted to classification by binning and classification to regression (in case the classes can be ordered by some criterion) with interpolation.

Hyperparameter learning/tuning

An artificial neural net has two kinds of parameters: the weights and biases that are learned during training, and the so-called hyper or meta parameters that do not changes during a training process, like for example the net architecture (number of layers / neurons per layer), the learning rate or other algorithmic constants. As they influence the performance they need to be learned as well and that's what the development aka evaluation split is for: i.e. a part of the data that is not used for training nor for the final test, but to evaluate the current hyperparameters. The easiest approach is a so called grid search, i.e. try all different combinations. But because the number of combinations grows exponentially with the number of tuned hyperparameters, a stochastic random based search or a learning algorithm is much more sensible.

Supervised vs. unsupervised

means the distinction whether your training data is annotated (or labeled) with respect to your task. An example: If you want to build a machine learner for human age estimation based on speech, you might give an algorithm a lot of examples of human speech annotated with the age of the person. This would be your training data and the approach would be supervised (by the age annotations). With unsupervised learning, you would give an algorithm simply a lot of human speech data and might ask it to cluster the data, based on differences. And might hope that the resulting clusters coincide with age.

Semi-supervised learning

means to use so-called soft labels for training, i.e. annotations that were generated by a machine learning predictor. If you have vast amounts of data but only a part of them is annotated, you might try to train a machine learner supervised with the annotated data and then use the resulting model to predict the rest. A variant would be to use the seed model to search for interesting data to be annotated, as for example rare events in you data.

Self-supervised learning

means techniques to prepare an artificial neural net by learning something on the data without predicting a concrete (supervised) feature/label/annotation. This can be done for example by masking parts of the data and training the net on predicting the masked parts as is done with the transformer technique. A different approach would be to use a triplet loss by training the net to distinguish near-by from far-away data. Once pretrained, a self-supervised net can be used for various down-stream tasks, like for example classification or regression of labeled data.

Reinforcement learning

is a fundamentally different kind of machine learning, usually metaphored with "learning like a child". Although mostly it is, it actually does not necessarily require to be implemented with artificial neural nets. The main idea is that an actor receives sensations from an environment that are interpreted and lead to new actions based on an evaluation criterion. As opposed to loss functions with neural learning, the evaluation criterion is rather abstract, like "reach the wall" (for a robot that should learn how to walk) or "win the game" for a chess player. It is rather tricky to translate evaluation criteria to concrete loss functions (needed to train a machine learner), and reinforcement learning requires vast amounts of data that usually are generated by simulations. That's the reason why, although very charming as an idea, reinforcement learning is successful mainly in gaming applications.

Representation learning

is learning to distinguish the essence of the data at hand from noise factors (that might come from the recording of the data). For example with speech it is mainly not interesting which microphones recorded the speakers and how the room acoustics was. All this is present in the data, but usually not important for the task at hand and, in fact, one of the reasons for over fitting (learning the training data but not the task) and lack of generalization (being able to recognize out-of-domain data: from different sources). As modeling data in machine learning is always a dimension reduction, representation learning searches for the dimensions that best represent the interesting aspects of the data. Self-supervised learning is a kind of representation learning.

Transfer learning

means to transfer knowledge from one domain to another. There are many ways to do this, for example you may pretrain your model with data from one domain and then finetune it with the data that represents your application. Self-supervised learning is also a kind of transfer learning. Another approach is multi task learning, where one large artificial neural net is trained for several tasks in parallel, with a so-called multi-head architecture (meaning the last layers are separated for each task).
The main idea is that you can use large quantities of data that are related to your task.

Pretraining and finetuning

means that you pretrain an artificial neural net with some large data sets that are related to you task, or at least have the same modality. You then remove the last layers of you net and add at least an output layer for your task. An example: wav2vec2.0 is a model trained on many hundreds of hours of speech data with ASR (automatic speech recognition) as a main target, but can be used as embeddings to classify emotional expression in speech.

One/few shot learning

means learning classes that have very few examples in the training data, by deriving information from related classes with more samples.

Zero shot learning

means to be able to predict classes that the machine learner has never seen at training time by using some auxiliary information, for example textual.

Adversarial learning

means generally the attempt to corrupt a model after it has been deployed in order to achieve some unexpected (from the developers and normal users) behavior. Examples would be to trick models into false classification by disguising the input in some form, re-engineering the model by learning from in- and output behavior or influencing the training data to harm the model.

Active learning

means that the machine learner itself is asking actively a so-called teacher (or oracle), often a human labeler, to annotate samples it is unsure of.

Curriculum learning

The technique of Curriculum learning is again inspired by human learning (like reinforcement learning), by copying the strategy to get better on a problem or ability to first look at clear and easily separatable samples and then progressivly the more difficult ones. This might prevent overfitting and definitely leads to models with much higher initial performance.

Contrastive learning

Contrastive learning is a kind of unsupervised learning by simply looking at different data items from a set and check the difference between them by contrasting similar from different items. This can be used for example by the triplet loss function where a data sample gets compared to one that is considered near-by (e.g. coming from the same context) and a third one that is considered to be far-away (e.g. a different context). The model is then trained without labels (apart from the context, which can usually be derived without labels) to distinguish between the positive and negative samples.

Federated / collaborative learning

means to distribute model training across a multitude of devices and/or servers, mainly to preserve privacy by not sharing the data but processing it on device and sending model updates.

Ensemble learning / Meta learning

means to use several machine learners and fuse the decisions later, either be rule or learned. Some boosting techniques are based on this idea already in the algorithm.

Continuous / life long learning

means ANN architectures that prevent the overwriting the weights (aka "forgetting") when new training data comes in. This is important when one task is learned continuously on data coming from different domains, e.g. a voice diary that shall learn emotion recognition through the day.

Disentangled representation learning

is a unsupervised method that means the idea to learn different aspects from the data with respect to their level of abstraction (instead of simply representing each data as a point in some space), by adding to the "raw" features some that are interpretable and independent from each other, e.g. speech rate and mean tone. This enhances interpretability/explainability and robustness.

Foundation model

A model that is trained, usually unsupervised, on very large quantities of data. The penultimate layer can then be used for so-called down-stream tasks, for example as automatically learned features.

Terminology

Loss function

is the function that artificial neural nets use to track progress, i.e. the function that evaluates the predicted outcome with the desired one. Finding a good loss function is crucial for your task.

Backpropagation

Fundamental way to train neural networks by evaluating the error with the loss function and than propagating it backwards towards the input layer, by taking the derivative.

Batch size

number of samples in one batch in the training which are used together to compute the error (-> loss function) and do the backpropagation step

Embeddings

are learned representations of data, usually the pen-ultimate layer of a pretrained artificial neural net.

Latent space

means the property of deep artificial neural nets to represent specific features of the data within the higher layers, for example speaker characteristics or expressed emotion in a net trained for speech synthesis. This is often used to influence the output in a desired way, for example simulating a specific speaking style.

Freezing

layers in an ANN means to not update the weights, as they might contain knowledge that should not be forgotten (from a pretrained net) or to make the training faster.

Drop out

is the technique to delete a number of randomly selected neurons in a hidden layer during training to prevent overfitting.

Patience

Number of epochs with no improvement after which training will be stopped.

Overfitting

means that the machine learner performs well on the training but not on any other data. This is usually the case when the model has enough complexity to distinguish all training data and is trained for enough periods (one period is one run through the training). Measures against this are subsumed under the label regularization.

Vanishing / exploding gradient

means that the weights of the neurons become too small or too large for the net to be stable. This happens especially with very deep (many layers) networks.

Bias vs. variance

means the trade-off between generalization (high bias, underfitting) and specification (high variance, overfitting). You can either
a) have simple models, like e.g. linear regression classifiers, that will treat every input with a similar strong bias (wrong decisions), irrespective of the training set, or
b) very complex models (e.g. a neural net with many layers) that will be more exact but very specific to your training data.
Here's a nice visualization of bias vs. variance.

ANN architectures

Perceptron

Perceptron is the original name that Minsky and Papert 1969 gave to the concept to model learning as a linear equation filtered by a non-linear function, inspired by the human neuron cell that fires only if a certain electric potential has been reached.

MLP/FFN- Multilayer Perceptron / -Feedforward Neural Network

Many perceptrons organized in layers of neurons, transforming information from input to output (one direction) while during training stage updating the weights (of the neurons/perceptrons) by the so.called backpropagation algorithm.
These are so-called vanilla networks because very simple (and the first ones which were developed), but still being used often as the last layers of a network to actually deliver a result. The main problem is the very large number of connections (and thus weights) and the vanishing gradient if they have many layers.

RNN - Recurrent Neural Network

is an ANN architecture where cells can have their own output as input, which means that the ANNs get a time dimension that does not exist with the older Feed Forward nets.

CNN - Convolutional Neural Network

are ANNs that reduce the number of connections between cells by introducing filters/patches that can be reused across the input field. This is inspired by techniques from image analysis. One of the big advantages (apart from reducing the number of weights) is that the layers are in parts interpretable as they become more and more high leven.

LSTM - Long-Shortterm Neural Network

Is a kind of RNN with memory cells. A simpler form is known as GRU (Gated recurrent units).

ResNet - Residual Neural Network

Are very deep ANNs that avoid the vanishing gradient problem by introducing skip connections, randomly placed and weighted (the weights are learned as well).

GANs - Generative Adversarial Networks

Is a combination of two networks: a generator that tries to replicate samples from a training set, and a discriminator that tries to distinguish the original and the generated samples. As both get better, the likeness of the fake samples get better.

VAE- Variational Autoencoders

Are two networks, an encoder and a decoder, the task is to restore a sample that was reduced to a lower dimension by the encoder. With variational AEs, the encoder decoder input can be interpolated and thus new samples as mixture of the original ones created. Another use case is representation learning, as due to the dimensionality reduction step it is learned which information in the signal is relevant for the nature of the samples.

Sources

I'd like to reveal some of my sources, much indebted:

Bio

Dr. Felix Burkhardt does teaching, consulting, research and development on speech communication, human-machine dialog systems, text-to-speech synthesis, speaker classification, ontology based natural language modeling and emotional human-machine interfaces.

Originally an expert of Speech Synthesis at the Technical University of Berlin, he wrote his ph.d. thesis on the simulation of emotional speech by machines, recorded the Berlin acted emotions database, "EmoDB", and maintains several open source projects, including the emotional speech synthesizer "Emofilt" and the speech labeling, the annotation tool "Speechalyzer" and the machine learning framework "Nkululeko". Since 2018 he is the research director at audEERING after having worked for the Deutsche Telekom AG for 18 years. From 2020-2022 he worked as a full professor at the institute of communication science of the Technical University of Berlin.

He was a member of the European Network of Excellence HUMAINE on emotion-oriented computing and is the editor of the W3C Emotion Markup Language specification and serves the program committee for numerous conferences including ACII, AVEC, EmoSPACE, FLAIR, IASTED CI, ICASSP, ICMI, ICPhS, Interspeech, IVA, IWSDS, LREC, Paraling, Prosico, SLSP, WS3P, journals: Specom, CSL, JASA, EURASIP, SIGPRO, IEEE-TAFFC, IEEE-TMM, IEEE-TASL, IEEE-TIP, IEEE-ISSI, IJSE, ETRI, Journal of Phonetics, Neural Processing Letters, UMUAI, and publishers: Wiley