Category Archives: tutorial

Predict emotional states with the audEERING model

audEERING recently published an emotion prediction model based on a finetuned Wav2vec2 transformer model.

Here I'd like to show you how you can use this model to predict your audio samples (it is actually also explained in the Github link above).

As usual, you should start with dedicating a folder on your harddisk for this and install a virtual environment:

virtualenv -p=3 venv[![](](

which means we want python version 3 (and not 2)
Don't forget to activate it!

Then you would need to install the packages that are used:

protobuf == 3.20

easiest to copy this list into a file called requierments.txt and then do

pip install -r requirements.txt

and start writing a python script that includes the packages:

import audeer
import audonnx
import numpy as np
import audiofile
import audinterface

, load the model:

# and download and load the model
url = ''
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')

archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)

sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)

load a test sentence (in 16kHz 16 bit wav format)

# read in a wave file for testing
signal, sampling_rate ='test.wav')

and print out the results

# print the results in the order arousal, dominance, valence.
print(model(signal, sampling_rate)['logits'].flatten())

You can also use audinterace's magic and process a whole list of files like this:

# define the interface
interface = audinterface.Feature(
        'outputs': 'logits',
# create a list of audio files
files = ['test.wav']
# and process it

should result in:

Also check out this great jupyter notebook from audEERING

Get your speech recognized with Whisper

OpenAI published new speech recognition models that are very easy to use and work in many languages trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

In my case all I had to do to recognize some German test:

# create a virtual environment
virtualenv venv
# activate it
. venv/bin/activate
# install whisper
pip install git+
# run the test
whisper test.wav --language German

And my file got recognized correctly, though it took a very long time: for the tiny model speed = x32, i.e. 32 times the time of the speech file duration, was announced

Nkululeko: How to evaluate a test set with a given best model

Since version 0.27.0, Nululeko has a concept for a test set, despite train and dev set.

Let's recap the concept of train/dev/test splits:

  • train is used to train a supervised model
  • dev is a set to evaluate this model, i.e. know when it is a good model (that doesn't overfit)
  • test is a set to be used ONLY once: for the real use of the model. If you would use the test as a dev set, you can't be sure if you're not overfitting again (because you used the dev set to adjust the meta parameters of your model).

So, in order to evaluate a third dataset ( beneath train and dev) you set a label_data entry in the configuration [DATA] section like so:

label_data = emovo
label_result = my_label_result.csv

and then run the experiment.

How to use Latex for your project documentation

Using a documentation system that separates content and presentation has many advantages, the biggest one probably flexibility.
I vote for latex and since there is now a company that offers free latex environment, you don't have to set it up yourself (you still can, but it might be tedious).

I've set up a sample project that you should be able to copy and use as a start here:

Overleaf sample project

How to speech synthesize in German with ESPnet

Here is how to text to speech (TTS) synthesize with a German single female speaker Tacotron2 model and esp2 net

You need the python packages

pip install torch espnet_model_zoo phonemizer

Then you can run

import soundfile
from espnet2.bin.tts_inference import Text2Speech

model = ''
text2speech = Text2Speech.from_pretrained(model)

speech = text2speech("Wow, das war ja einfach!")["wav"]
soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16")

How to segment and label a speech database

Segmenting means in this case: splitting a longer audio file based on speech pauses.

This post shows you how to record, segment and then label a speech recording using the Ina speech segmenter and Labeltool.

Record audio

Firstly you need a recording. You might do that with your mobile phone or a microphone connected to your computer using, for example, Audacity.

I'd recommend recording / changing the sample rate to 16 kHz, as this is sufficient for speech recordings.

Let's say you stored your recording in the file longer_test.wav inside a directory named utterances.

Segment the recording

We start doing the segmentation in a python script.
You need some packages installed: pandas, inaSpeechSegmenter and audformat

# we start with the imports
import pandas as pd
from inaSpeechSegmenter import Segmenter
from inaSpeechSegmenter.export_funcs import seg2csv, seg2textgrid
from audformat.utils import to_filewise_index
from audformat import segmented_index

# we then use variables for our recording:
root =  './utterance/'
media = 'longer_test.wav'

# the INA speech segmenter is used very easy:
seg = Segmenter()
segmentation = seg(root+media)

# if curious, try:

# then collect the segments that were recognized as human, either female or male:
files, starts, ends = [], [], []
for entry in segmentation:
    kind = entry[0]
    start = entry[1]
    end = entry[2]
    if kind == 'female' or kind == 'male':
        print (f'{media}, {start}, {end}')
seg_index = segmented_index(files, starts, ends)

#  this index can now be used by audformat to acutally cut the audio file into segments
df = pd.DataFrame(index = seg_index)
file_list = to_filewise_index(df , root, 'audio_out', progress_bar = True)

# the resulting list can be stored to disk:
file_list.to_csv('file_list.csv', header=False)

Label the recording

labeling means to add metadata to the samples, for example emotional arousal.
There are hundreds of tools to do this, I use of course the one i programmed myself, Speechalyzer 😉
Here's a tutorial how to set this up and how to adapt the tool

If running on linux, you could then start the Speechalyzer with the file list you created like this:

java -jar ~/research/Speechalyzer/Speechalyzer.jar -cf ~/research/Speechalyzer/res/ -fl file_list.csv

and then simply start the Labeltool to label the files.

Speechalyzer can then export the labels to a file which can be used by Nkululeko as a labeled speech database in CSV format.

How to combine feature sets with Nkululeko

If you want to use combine several acoustic parameter (feature) sets with nkululeko, you might state

type = ['mld', 'praat']
features = ['JitterPCA', 'meanF0Hz', 'hld_sylRate']

This would combine the

  • hld_sylRate feature from MLD
  • JitterPCA feature from Feinberg's Praat features and
  • meansF0Hz feature from Feinberg's Praat features

Of course you could omit the features entry and simply use all of them.

It's interesting to see how many emotions from Berlin Emodb can still be recognized with only these three parameters:

How to use selected features from Praat with Nkululeko

If you want to use acoustic parameters extracted by the wonderful Praat software with nkululeko, you state


in the feature section of your config file.
If you like to use only some features of all the ones that are extracted by David R. Feinberg's Praat scripts, you can look at the output and select some of them in the FEAT section, e.g.

type = praat
features = ['JitterPCA', 'meanF0Hz']

it is interesting to see, how many emotions of Berlin EmoDB still get recognized with only mean F0 and Jitter as features:


How to test a trained model on a new test set with Nkululeko

Sometimes you might want to test your already trained model(s) on a new dataset, e.g. because the training took a lot of resources.
If you stored your models during the training this is possible.

databases = ['emodb']
save = True

In a new config file for your experiment that uses a dufferent test set, you set

databases = ['emodb', 'polish']
trains = ['emodb']
tests = ['polish']
strategy = cross_data....
only_test = True

In the example above, emodb has been used as the training database, and polish in a second experiment later as a test database.

How to compare several MLP layer layouts with each other

Some days ago I showed how you can run several experiments in one go.
Obviously this can be used to compare several ANN layer architectures as an alternative to the approach discussed in this (much earlier) post

There is an example configuration shipped with Nkululeko, and you simply can specify your layer specifications per experiment like this:

classifiers = [
    {'--model': 'mlp',
    '--layers': '\"{\'l1\':16,\'l2\':4}\"'},
    {'--model': 'mlp',
    '--layers': '\"{\'l1\':64,\'l2\':16}\"'},
    {'--model': 'mlp',
    '--layers': '\"{\'l1\':128,\'l2\':32}\"',
    '--learning_rate': '.0001',
    '--drop': '.3',},
    {'--model': 'xgb',
    {'--model': 'svm',

i.e in this example three MLP classifiers are specified with architectures:

  • (hidden) layer 1 with 16 neurons, and (hidden) layer 2 with 4 neurons
  • one layer with 64 and one with 16 neurons
  • and a third one with
    • one layer with 128 and a second one with 32 neurons,
    • learning rate of .0001 and
    • dropout probability of 30%

and, for comparison:

  • a XGB classifier
  • and a SVM classifier

both only need to be trained one epoch because there are no weights to be adapted.
The MLP classifiers are trained with the epoch number that is specified in the sceleton config file