Tag Archives: emodb

Setting up a base nkululeko experiment

This is one of a series of posts on how to use nkululeko and deals with setting up the "hello world" of nkululeko: performing classification on the berlin emodb emotional datbase.

Typically nkululeko experiments are defined by two files:

  • a python file that is called by the interpreter
  • an initialization file that is interpreted by the nkululeko framework

First we'll take a look at the python file:

# my_experiment.py
# Demonstration code to use the Nkululeko framework

import sys
sys.path.append("TO BE ADAPTED/nkululeko/src")
import configparser # to read the ini file
import experiment as exp # central nkululeko class
from util import Util # mainly for logging

def main(config_file):
    # load one configuration per experiment
    config = configparser.ConfigParser()
    config.read(config_file) # read in the ini file, the experiment is defined there
    util = Util() # init the logging and global stuff

    # create a new experiment
    expr = exp.Experiment(config)
    util.debug(f'running {expr.name}')

    # load the data sets (specified in ini file)

    # split into train and test sets
    util.debug(f'train shape : {expr.df_train.shape}, test shape:{expr.df_test.shape}')

    # extract features
    util.debug(f'train feats shape : {expr.feats_train.df.shape}, test feats shape:{expr.feats_test.df.shape}')

# initialize a run manager and run the experiment

if __name__ == "__main__":
    main('PATH TO INI FILE/exp_emodb.ini') 
    # main(sys.argv[1]) # alternatively read it from command line

and this would be a minimal nkululeko configuration file (tested with version 0.8)

root = ./emodb/
name = exp_emodb
databases = ['emodb']
emodb = TO BE ADAPTED/emodb
emodb.split_strategy = speaker_split
emodb.testsplit = 40
target = emotion
labels = ['anger', 'boredom', 'disgust', 'fear', 'happiness', 'neutral', 'sadness']
type = os
type = svm

I hope the names of the entries are self-explanatory, here's the link to the config file description

Get all information from emodb

When you load the Berlin emodb as has been done in numerous postings of this blog, you will get per default only information on file name, speaker id, text id and emotion.

But there is more information contained in the audformat file and this posts shows you how to access it.

If not already somewhere on your computer, start by downloading the emodb:

if not os.path.isdir('./emodb/'):
    !wget -c https://tubcloud.tu-berlin.de/s/LfkysdXJfiobiEG
    !mv download emodb_audformat.zip
    !unzip emodb_audformat.zip
    !rm emodb_audformat.zip

This code will then load the database, prepare a single dataframe with all information and store it to disk for later use:

# load the database to memory
root = './emodb/'
db = audformat.Database.load(root)
# map the file pathes to the audio
db.map_files(lambda x: os.path.join(root, x))   
# access speaker gender and age, and transcription, from the speaker dictionaries
df = db.tables['files'].get(map={'speaker': ['speaker', 'gender', 'age'], 'transcription': ['transcription']})
# copy the emotion label from the the emotion dataframe to the files dataframe
df['emotion'] = db.tables['emotion'].df['emotion']
# add a column with the word count
df['wordcount'] = df['transcription'].apply (lambda row: len(row.split()))
# store to disk for later use


Predict emodb emotions with a Multi Layer Perceptron ANN

This post shows you how to classify emotions with a Multi Layer Perceptron (MLP) artificial neural net based on the torch framework (a different very famous ANN framework would be Keras).

Here's a complete jupyter notebook for your convenience.

We start with some imports, you need to install these packages, e.g. with pip, before you run this code:

import audformat
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import os
import opensmile
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score

Then we need to download and prepare our sample dataset, the Berlin emodb:

# get and unpack the Berlin Emodb emotional database if not already there
if not os.path.isdir('./emodb/'):
    !wget -c https://tubcloud.tu-berlin.de/s/8Td8kf8NXpD9aKM/download
    !mv download emodb_audformat.zip
    !unzip emodb_audformat.zip
    !rm emodb_audformat.zip
# prepare the dataframe
db = audformat.Database.load('./emodb')
root = './emodb/'
db.map_files(lambda x: os.path.join(root, x))    
df_emotion = db.tables['emotion'].df
df = db.tables['files'].df
# copy the emotion label from the the emotion dataframe to the files dataframe
df['emotion'] = df_emotion['emotion']

As neural nets can only deal with numbers, we need to encode the target emotion labels with numbers:

# Encode the emotion words as numbers and use this as target 
target = 'enc_emo'
encoder = LabelEncoder()
df[target] = encoder.transform(df['emotion'])

Now the dataframe should look like this:


To ensure that we learn about emotions and not speaker idiosyncrasies we need to have speaker disjunct training and development sets:

# define fixed speaker disjunct train and test sets
train_spkrs = df.speaker.unique()[5:]
test_spkrs = df.speaker.unique()[:5]
df_train = df[df.speaker.isin(train_spkrs)]
df_test = df[df.speaker.isin(test_spkrs)]

print(f'#train samples: {df_train.shape[0]}, #test samples: {df_test.shape[0]}')
#train samples: 292, #test samples: 243

Next, we need to extract some acoustic features:

# extract (or get) GeMAPS features
if os.path.isfile('feats_train.pkl'):
    feats_train = pd.read_pickle('feats_train.pkl')
    feats_test = pd.read_pickle('feats_test.pkl')
    smile = opensmile.Smile(
    feats_train = smile.process_files(df_train.index)
    feats_test = smile.process_files(df_test.index)

Because neural nets are sensitive to large numbers, we need to scale all features with a mean of 0 and stddev of 1:

# Perform a standard scaling / z-transformation on the features (mean=0, std=1)
scaler = StandardScaler()
feats_train_norm = pd.DataFrame(scaler.transform(feats_train))
feats_test_norm = pd.DataFrame(scaler.transform(feats_test))

Next we define two torch dataloaders, one for the training and one for the dev set:

def get_loader(df_x, df_y):
    for i in range(len(df_x)):
       data.append([df_x.values[i], df_y[target][i]])
    return torch.utils.data.DataLoader(data, shuffle=True, batch_size=8)
trainloader = get_loader(feats_train_norm, df_train)
testloader = get_loader(feats_test_norm, df_test)

We can then define the model, in this example with one hidden layer of 16 neurons:

class MLP(torch.nn.Module):
    def __init__(self):
        self.linear = torch.nn.Sequential(
            torch.nn.Linear(feats_train_norm.shape[1], 16),
            torch.nn.Linear(16, len(encoder.classes_))
    def forward(self, x):
        # x: (batch_size, channels, samples)
        x = x.squeeze(dim=1)
        return self.linear(x)

We define two functions to train and evaluate the model:

def train_epoch(model, loader, device, optimizer, criterion):
    losses = []
    for features, labels in loader:
        logits = model(features.to(device))
        loss = criterion(logits, labels.to(device))
    return (np.asarray(losses)).mean()

def evaluate_model(model, loader, device, encoder):
    logits = torch.zeros(len(loader.dataset), len(encoder.classes_))
    targets = torch.zeros(len(loader.dataset))
    with torch.no_grad():
        for index, (features, labels) in enumerate(loader):
            start_index = index * loader.batch_size
            end_index = (index + 1) * loader.batch_size
            if end_index > len(loader.dataset):
                end_index = len(loader.dataset)
            logits[start_index:end_index, :] = model(features.to(device))
            targets[start_index:end_index] = labels

    predictions = logits.argmax(dim=1)
    uar = recall_score(targets.numpy(), predictions.numpy(), average='macro')
    return uar, targets, predictions

Next we initialize the model and set the loss function (criterion) and optimizer:

device = 'cpu'
model = MLP().to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
epoch_num = 250
uars_train = []
uars_dev = []
losses = []

We can then do the training loop over the epochs:

for epoch in range(0, epoch_num):
    loss = train_epoch(model, trainloader, device, optimizer, criterion)
    acc_train = evaluate_model(model, trainloader, device, encoder)[0]
    acc_dev, truths, preds = evaluate_model(model, testloader, device, encoder)
# scale the losses so they fit on the picture
losses = np.asarray(losses)/2

Next we might want to take a look at how the net performed with respect to unweighted average recall (UAR):

plt.plot(uars_train, 'green', label='train set') 
plt.plot(uars_dev, 'red', label='dev set')
plt.plot(losses, 'grey', label='losses/2')

And perhaps see the resulting confusion matrix:

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(truths, preds,  normalize = 'true')
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=encoder.classes_).plot(cmap='gray')

Machine classification of emotional speech with EmoDB and python

This is a tutorial on how to

  • configure a python environment with Jupyter notebook
  • download Berlin EmoDB
  • import the audformat database
  • extract acoustic features with opensmile
  • perform a machine classification with sklearn

It does expect some experience with

  • unix commands
  • python
  • pandas

So if you miss this you might have to google the stuff you don't understand.
In case you know German and use Windows I recorded this screencast for you.

There is a Kaggle notebook that you could use to try this out.

Configure a python environment

I start from the point where you got python installed on your machine and have a shell (console window).
I use Unix commands here, most of them should also work on Mac OS, for Windows you might have to adapt some (e.g. 'ls' becomes 'dir').

So if you type 


in your shell, the python interpreter should start, you can quit it with the command 


Create a subfolder for your project and enter it, e.g.

mkdir emodb; cd emodb

Create a virtual environment for your project

python3 -m venv ./venv 

Activate your project

source ./venv/bin/activate

which should result in your prompt including the environment name like e.g. this


You can leave your environment with the


command. For now though, please make sure you have the environment activated. You can then install the most important packages with pip like this:

pip install pandas numpy jupyter audformat opensmile sklearn matplotlib

If all goes well, you should now be able to start up the jupyter server which should give you an interface in your browser window.

jupyter notebook &

And create a new notebook by clicking the "New" button near the left top corner.

Get and unpack the Berlin Emodb emotional database

You would start by downloading and unpacking emodb like this (of course you can do this as well outside the notebook in your shell):

!wget -c https://tubcloud.tu-berlin.de/s/8Td8kf8NXpD9aKM/download
!mv download emodb_audformat.zip
!unzip emodb_audformat.zip
!rm emodb_audformat.zip


import audformat
db = audformat.Database.load('./emodb')

you can load the database and inspect the contents.

You still have to state the absolute path to the audio files for further processing. You would find the current directory with the


command, and would add the emodb folder name to it and prefix this to the wav file paths like so

import os
root = '/...my current directory.../emodb/'
db.map_files(lambda x: os.path.join(root, x))

To check that this worked you might want to listen to a sample file

import IPython

which should give you a GUi like this

Extract acoustic features

EmoDB is annotated with emotional labels. If we want to classify these emotions automatically we need to extract acoustic features first.

We can do this easily in python with dedicated packages for this like the Praat software or opensmile. In this tutorial we'll use opensmile.

First we will get the Pandas dataframes from the database like this:

df_emo = db.tables['emotion'].df
df_files = db.tables['files'].df

and might want to inspect the class distribution with pandas


then, with

import opensmile
smile = opensmile.Smile(

you construct your feature extractor and with

df_feats = smile.process_files(df_emo.index)

should be able to extract the 62 GeMAPS acoustic features, which you could check by looking at the dimension of the dataframe


and looking at the first entry


You might run into trouble later because the smile.process function per default results in a multiindex with filename, start and end time (because you might have extracted low level features per frame). In the following picture i show my screen so far to illustrate the situation.

So you end up with three data frames:

  • df_emo with emotion labels and confidence
  • df_files with duration, speaker id and transcript index
  • df_feats with the features
    So in order to be able to match the indeces from this three dataframes, I dropped the start and end columns from the features dataframe with the following lines:

    df_feats.index = df_feats.index.droplevel(1) # drop start level
    df_feats.index = df_feats.index.droplevel(1) # drop end level

Perform a statistical classification on the data

Now we would conclude this tutorial by performing a first machine classification.
You basically need four sets of data for this: each a feature and label set for a training and a test (or better: development) set.

In a naive approach, we use the first 100 entries of the EmoDB for test and the others for training:

test_labels = df_emo.iloc[:100,].emotion
train_labels = df_emo.iloc[100:,].emotion
test_feats = df_feats.iloc[:100,]
train_feats =  df_feats.iloc[100:,]

There are numerous possibilities to use machine classifiers in python, if we don't want to code one ourselves we might want to use on from the sklearn package, for example an implementation of the SVM (support vector machine) algorithm

from sklearn import svm
clf = svm.SVC()

train it with our training features and labels

clf.fit(train_feats, train_labels)

, compute predictions on the test features

pred_labels  = clf.predict(test_feats)

and compare the predictions with the real labels (aka ''ground truth'') with a confusion matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels.emotion, pred_labels)

and by computing the unweighted average recall (UAR)

from sklearn.metrics import recall_score
recall_score(test_labels, pred_labels, average='macro')

This results in chance level as the SVM classifier lazily always decided on the majority class. The results can be improved to something more meaningful, e.g. by passing better meta parameters when constructing the classifier:

clf = svm.SVC(kernel='linear', C=.001)

and repeating the experiment, which should result in a confusion matrix like this one (see Kaggle notebook for code):


This concludes the tutorial so far, what to do next?

Here are some suggestions:

  • What is really problematic with the above approach is that the training and the test set are not speaker independent, i.e. the same 10 speakers appear in both sets.
    • Which means you can not know if the classifier learned anything about emotions or (more probable) some idiosyncratic peculiarities of the speakers.
    • With so few speakers it doesn't make a lot of sense to further divide them, so what people often do is perform a LOSO (leave-one-speaker-out) or do x-cross validation by testing x times a part of the speakers against the others (in the case of EmoDB this would be the same if x=10).
  • What's also problematic is that you only looked at one (very small, highly artificial) database and this usually does not result in a usable model for human emotional behavior.
    • Try to import a different database or record your own, map the emotions to the EmoDB set and see how this performs.
  • SVMs are great, but you might want to try other classifiers.
    • Perform a grid search on the best meta-parameters for the SVM.
    • Try other sklearn classifiers.
    • Try other famous classifiers like e.g. XGBoost.
    • Try ANNs (artificial neural nets) with keras or torch.
  • Try other features
    • There are other opensmile feature set configurations available.
    • Do feature selection and to identify the best ones to see if they make sense.
    • Try other features, e.g. from Praat or other packages.
    • Try embeddings from pretrained ANNs like e.g. Trill or PANN features.
  • The opensmile features are all given as absolute values.
    • Try to normalize them with respect to the training set or each speaker individually.
  • Generalization is often improved by adding acoustic conditions to the training:
    • Try augmenting the data by adding samples mixed with noise or bandpass filters.
  • Last not least: code an interface that lets you test the classifier on the spot.