Try the audEERING emotion model

The speech AI company audEERING open sourced a model to classify emotional dimensions, i.e. arousal, valence and dominance.

In this tutorial, let's see how the open-domain emotional database EmoDB is categorized by this model (which is trained on a different emotional database: MSPPodcast).

Thanks to Johannes Wagner for providing the code used in this tutorial.

We'll do this in a jupyter notebook.
Here is the list of requirements you need to install (after having activated your environment):

pip install juypter pandas umap-learn audb audonnx matplotlib seaborn audinterface

We start our notebook with the imports:

import numpy as np
import pandas as pd
import umap
import matplotlib.pyplot as plt
import seaborn as sns
import audeer
import audonnx
import audb
import audformat
import audinterface
# and two constants:
sampling_rate = 16000
model_id = '6bc4a7fd-1.1.0'

We'll then load the model like this:

url = f'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.{model_id}.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')
archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)
# and inspect it:
print(model)

We load the database:

db = audb.load(
    'emodb',
    version='1.3.0',
    format='wav',
    mixdown=True,
    sampling_rate=sampling_rate,
)
emotion_test = db[f'emotion.categories.test.gold_standard']['emotion'].get()
emotion_train = db[f'emotion.categories.train.gold_standard']['emotion'].get()
emotion = pd.concat([emotion_test, emotion_train])
speaker = db['files']['speaker'].get(emotion.index)
gender = db['files']['speaker'].get(emotion.index, map='gender')
transcription = db['files']['transcription'].get(emotion.index)
df_labels = audformat.utils.concat([emotion, speaker, gender, transcription])
df_labels.head(1)
print(df_labels.shape)

We create two interface: one for the logits (emotional dimensions) and one for the features (embeddings: pen-ultimate layer of network).

interface_logits = audinterface.Feature(
    model.labels('logits'),       # feature names
    process_func=model,
    process_func_args={
        'outputs': 'logits',      # output 'logits'
    },   
    verbose=True,
)
interface_features = audinterface.Feature(
    model.labels('hidden_states'),
    process_func=model,
    process_func_args={
        'outputs': 'hidden_states',
    },
    verbose=True,
)

and then we can extract them simply by stating:

df_features = interface_features.process_index(
    df_labels.index, 
    cache_root=audeer.path(cache_root, model_id, 'features'),
)
df_logits = interface_logits.process_index(
    df_labels.index, 
    cache_root=audeer.path(cache_root, model_id, 'logits'),
)
# and inspect them
print(df_logits.head(1))
print(df_logits.shape, df_features.shape)

To visualize, we transform the features to two dimensions:

y_umap = umap.UMAP(
    n_neighbors=10,
    random_state=0,
).fit_transform(df_features.values)

pd.DataFrame(
    y_umap,
    df_features.index,
    columns=['umap-0', 'umap-1'],
)

And then plot these, colored by the labels of the database:

fig, axs = plt.subplots(2, 2, figsize=[15, 15])
axs = axs.flatten()

for ax, column in zip(axs, df_labels):
    ax.set_title(column)
    _ = sns.scatterplot(
        x=y_umap[:, 0],
        y=y_umap[:, 1],
        hue=df_labels[column],
        ax=ax,
    )

Which should leave you with:

speechsurfer

Try the audEERING emotion model

Leave a Reply Cancel reply

blog around speech technology