Try the audEERING emotion model

The speech AI company audEERING open sourced a model to classify emotional dimensions, i.e. arousal, valence and dominance.

In this tutorial, let's see how the open-domain emotional database EmoDB is categorized by this model (which is trained on a different emotional database: MSPPodcast).

Thanks to Johannes Wagner for providing the code used in this tutorial.

We'll do this in a jupyter notebook.
Here is the list of requirements you need to install (after having activated your environment):

pip install juypter pandas umap-learn audb audonnx matplotlib seaborn audinterface

We start our notebook with the imports:

import numpy as np
import pandas as pd
import umap
import matplotlib.pyplot as plt
import seaborn as sns
import audeer
import audonnx
import audb
import audformat
import audinterface
# and two constants:
sampling_rate = 16000
model_id = '6bc4a7fd-1.1.0'

We'll then load the model like this:

url = f'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.{model_id}.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')
archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)
# and inspect it:
print(model)

We load the database:

db = audb.load(
    'emodb',
    version='1.3.0',
    format='wav',
    mixdown=True,
    sampling_rate=sampling_rate,
)
emotion_test = db[f'emotion.categories.test.gold_standard']['emotion'].get()
emotion_train = db[f'emotion.categories.train.gold_standard']['emotion'].get()
emotion = pd.concat([emotion_test, emotion_train])
speaker = db['files']['speaker'].get(emotion.index)
gender = db['files']['speaker'].get(emotion.index, map='gender')
transcription = db['files']['transcription'].get(emotion.index)
df_labels = audformat.utils.concat([emotion, speaker, gender, transcription])
df_labels.head(1)
print(df_labels.shape)

We create two interface: one for the logits (emotional dimensions) and one for the features (embeddings: pen-ultimate layer of network).

interface_logits = audinterface.Feature(
    model.labels('logits'),       # feature names
    process_func=model,
    process_func_args={
        'outputs': 'logits',      # output 'logits'
    },   
    verbose=True,
)
interface_features = audinterface.Feature(
    model.labels('hidden_states'),
    process_func=model,
    process_func_args={
        'outputs': 'hidden_states',
    },
    verbose=True,
)

and then we can extract them simply by stating:

df_features = interface_features.process_index(
    df_labels.index, 
    cache_root=audeer.path(cache_root, model_id, 'features'),
)
df_logits = interface_logits.process_index(
    df_labels.index, 
    cache_root=audeer.path(cache_root, model_id, 'logits'),
)
# and inspect them
print(df_logits.head(1))
print(df_logits.shape, df_features.shape)

To visualize, we transform the features to two dimensions:

y_umap = umap.UMAP(
    n_neighbors=10,
    random_state=0,
).fit_transform(df_features.values)

pd.DataFrame(
    y_umap,
    df_features.index,
    columns=['umap-0', 'umap-1'],
)

And then plot these, colored by the labels of the database:

fig, axs = plt.subplots(2, 2, figsize=[15, 15])
axs = axs.flatten()

for ax, column in zip(axs, df_labels):
    ax.set_title(column)
    _ = sns.scatterplot(
        x=y_umap[:, 0],
        y=y_umap[:, 1],
        hue=df_labels[column],
        ax=ax,
    )

Which should leave you with:

Transformation architectures

Generally a difference for machine learners can be made by the nature of input and output.


source

One to one

Typically an application would be to classify the main motive of a picture (e.g. cat or dog) or the emotional category that is displayed in an audio recording. Key is, that the input is represented by a single vector of values of fixed length.

One to many

Many to one

Sequence to sequence

Many to many

ML course: introduction

This is a first of a series of posts to support my lecture "speech processing with machine learning".
Focus is an introduction to topics related, mainly machine learning as i teach phoneticians which already know a lot about speech.

This page is the landing page which serves as a table of contents for the posts, i will try to introduce a meaningful order for the posts, but sequential read is not required. As said, it's introductory anyway and it's very easy to find much deeper posts on the net. E.g. here's a great list with pictures

Links that are marked with (nkulu) are for posts that use Nkululeko as a hands-on exercise.

Media links

How to import features from outside the Nkululeko software

Since version 0.29.1 there is the possibilty to directly import acoustic features into the Nkululeko framework.

You can specify a file to be imported in the FEATS section:

[FEATS]
type = ['import']
import_file = ['/home/.../my_features_1.csv']

Of course the features still can be combined with other feature sets and will be assigned to training and test splits accordingly.

There can be several feature files (e.g. for train and dev serpately), and they must be in CSV format (comma separated values) in audformat with segmented index.
Here is an example:

file,start,end,voice segments,HNR Mean (dB),F1 Mean (Hz)
/home/.../a42_1.wav,0 days,0 days 00:00:07.815875,4.13,45,7.48,

Predict emotional states with the audEERING model

audEERING recently published an emotion prediction model based on a finetuned Wav2vec2 transformer model.

Here I'd like to show you how you can use this model to predict your audio samples (it is actually also explained in the Github link above).

As usual, you should start with dedicating a folder on your harddisk for this and install a virtual environment:

virtualenv -p=3 venv

which means we want python version 3 (and not 2)
Don't forget to activate it!

Then you would need to install the packages that are used:

pandas
numpy
audeer
protobuf == 3.20
audonnx
jupyter
audiofile
audinterface

easiest to copy this list into a file called requierments.txt and then do

pip install -r requirements.txt

and start writing a python script that includes the packages:

import audeer
import audonnx
import numpy as np
import audiofile
import audinterface

, load the model:

# and download and load the model
url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')

archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)

sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)

load a test sentence (in 16kHz 16 bit wav format)

# read in a wave file for testing
signal, sampling_rate = audiofile.read('test.wav')

and print out the results

# print the results in the order arousal, dominance, valence.
print(model(signal, sampling_rate)['logits'].flatten())

You can also use audinterace's magic and process a whole list of files like this:

# define the interface
interface = audinterface.Feature(
    model.labels('logits'),
    process_func=model,
    process_func_args={
        'outputs': 'logits',
    },
    sampling_rate=sampling_rate,
    resample=True,    
    verbose=True,
)
# create a list of audio files
files = ['test.wav']
# and process it
interface.process_files(files).round(2)

should result in:

Also check out this great jupyter notebook from audEERING

Get your speech recognized with Whisper

OpenAI published new speech recognition models that are very easy to use and work in many languages trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

In my case all I had to do to recognize some German test:

# create a virtual environment
virtualenv venv
# activate it
. venv/bin/activate
# install whisper
pip install git+https://github.com/openai/whisper.git
# run the test
whisper test.wav --language German

And my file got recognized correctly, though it took a very long time: for the tiny model speed = x32, i.e. 32 times the time of the speech file duration, was announced

Nkululeko: How to evaluate a test set with a given best model

Nululeko has two modules for testing and unknown data set, despite train and development/evaluation set.

Let's recap the concept of train/dev/test splits:

  • train is used to train a supervised model
  • dev is a set to evaluate this model, i.e. know when it is a good model (that doesn't overfit)
  • test is a set to be used ONLY once: for the real use of the model. If you would use the test as a dev set, you can't be sure if you're not overfitting again (because you used the dev set to adjust the meta parameters of your model).

So, in order to evaluate a third dataset ( beneath train and dev) you might have situations:
a) you have a labeled test set and want to evaluate it
b) you have an unknown test set (no labels) and want to add predictions (without evaluation)

For a),
you can use the test module, and set a tests entry in the configuration [DATA] section like so:

[DATA]
tests = ['my_testdb']
my_testdb = /mypath/my_testdb
...

and then call Nkululeko's test module

python -m nkululeko.test --config mycoonfg.ini --outfile myresults.csv

For b),
you can use the demo module and state your test set as a list of files like so:

python -m nkululeko.demo --config my_config.ini --list my_testsamples.csv --outfile my_results.csv

In order to use a model, of course you do need to have it trained and saved before. So you need a run with the nkululeko module before.

python -m nkululeko.nkululeko --config my_config.ini

with my_config,ini containing:

[EXP]
save = True
[MODEL]
save = True

Use python for image generation

Here are some suggestions to visualize your results with python.
The idea is mainly to put your data in a pandas dataframe and then use pandas methods to plot it.

Bar plots

Here's the simple one with one variable:

vals = {'24 layers':9.37, '6 layers teached':9.94, '6 layers':10.20, 'human':10.34}
df_plot = pd.DataFrame(vals, index=[0])
ax = df_plot.plot(kind='bar')
ax.set_ylim(8, 12)
ax.set_title('error in MAE')

Here's an example for a barplot with two variables and three features:

vals_arou  = [3.2, 3.6]
vals_val  = [-1.2, -0.4]
vals_dom  = [2.6, 3.2]
cols = ['orig','scrambled']
plot = pd.DataFrame(columns = cols)
plot.loc['arousal'] = vals_arou
plot.loc['valence'] = vals_val
plot.loc['dominance'] = vals_dom
ax = plot.plot(kind='bar', rot=0)
ax.set_ylim(-1.8, 3.7)
# this displays the actual values
for container in ax.containers:
    ax.bar_label(container)

Stacked barplots

Here's an example using seaborn package for stacked barplots:
For a pandas dataframe with columns age in years and db for two database names:

import seaborn as sns
f = plt.figure(figsize=(7,5))
ax = f.add_subplot(1,1,1)
sns.histplot(data=df, ax = ax, stat="count", multiple="stack",
             x="duration", kde=False,
              hue="db",
             element="bars", legend=True)
ax.set_title("Age distriubution")
ax.set_xlabel("Age")
ax.set_ylabel("Count")

Box plots

Here's a code comparing two box plots with data dots

import seaborn as sns
import pandas as pd
n = [0.375, 0.389, 0.38, 0.346, 0.373, 0.335, 0.337, 0.363, 0.338, 0.339]
e = [0.433 0.451, 0.462, 0.464, 0.455, 0.456, 0.464, 0.461 0.457, 0.456]
data = pd.DataFrame({'simple':n, 'with soft labels':e})
sns.boxplot(data = data)
sns.swarmplot(data=data, color='.25', size=1)

Confusion matrix

We can simply use the audplot package

from audplot import confusion_matrix

truth = [0, 1, 1, 1, 2, 2, 2] * 1000
prediction = [0, 1, 2, 2, 0, 0, 2] * 1000
confusion_matrix(truth, prediction)

Pie plot

Here is an example for a pie plot

import pandas as pd

label=lst:code_fig_pie]
import pandas as pd
plot_df = 
    pd.DataFrame({'cases':[461, 85, 250]}, 
    index=['unknown', 'Corona positive', 
    'Corona negative'])
plot_df.plot(kind='pie', y='cases', autopct='%.2f')

looks like this:

Histogram

import matplotlib.pyplot as plt
# assuming you have two dataframes with a speaker column, you could plot the histogram of samples per speaker like this 
test = df_test.speaker.value_counts()[df_test.speaker.value_counts()>0]
train = df_train.speaker.value_counts()[df_train.speaker.value_counts()>0]

plt.hist([train, test], bins = np.linspace(0, 500, 100), label=['train', 'test'])
plt.legend(loc='upper right')
# better use EPS for publication as it's vector graphics (and scales)
plt.savefig('sample_dist.eps')

How to use Latex for your project documentation

Using a documentation system that separates content and presentation has many advantages, the biggest one probably flexibility.
I vote for latex and since there is now a company that offers free latex environment, you don't have to set it up yourself (you still can, but it might be tedious).

I've set up a sample project that you should be able to copy and use as a start here:

Overleaf sample project