Tag Archives: wav2vec

Predict emotional states with the audEERING model

audEERING recently published an emotion prediction model based on a finetuned Wav2vec2 transformer model.

Here I'd like to show you how you can use this model to predict your audio samples (it is actually also explained in the Github link above).

As usual, you should start with dedicating a folder on your harddisk for this and install a virtual environment:

virtualenv -p=3 venv[![](http://blog.syntheticspeech.de/wp-content/uploads/2022/09/Screenshot-from-2022-09-29-13-16-50-300x50.png)](http://blog.syntheticspeech.de/wp-content/uploads/2022/09/Screenshot-from-2022-09-29-13-16-50.png)

which means we want python version 3 (and not 2)
Don't forget to activate it!

Then you would need to install the packages that are used:

pandas
numpy
audeer
protobuf == 3.20
audonnx
jupyter
audiofile
audinterface

easiest to copy this list into a file called requierments.txt and then do

pip install -r requirements.txt

and start writing a python script that includes the packages:

import audeer
import audonnx
import numpy as np
import audiofile
import audinterface

, load the model:

# and download and load the model
url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')

archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)

sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)

load a test sentence (in 16kHz 16 bit wav format)

# read in a wave file for testing
signal, sampling_rate = audiofile.read('test.wav')

and print out the results

# print the results in the order arousal, dominance, valence.
print(model(signal, sampling_rate)['logits'].flatten())

You can also use audinterace's magic and process a whole list of files like this:

# define the interface
interface = audinterface.Feature(
    model.labels('logits'),
    process_func=model,
    process_func_args={
        'outputs': 'logits',
    },
    sampling_rate=sampling_rate,
    resample=True,    
    verbose=True,
)
# create a list of audio files
files = ['test.wav']
# and process it
interface.process_files(files).round(2)

should result in:

Also check out this great jupyter notebook from audEERING

How to set up wav2vec embedding for nkululeko

Since version 0.10, nkululeko supports facebook's wav2vec 2.0 embeddings as acoustic features.
This post shows you how to set this up.

1) extend your environment

Firstly, set up your normal nkululeko installation
Then, within the activated environment, install the additional modules that are needed by the wav2vec model

pip install -r requirements_wav2vec.txt

2) download the pretrained model

Wav2vec 2.0 is an end-to-end architecture to do speech to text, but pretrained models can be used as embeddings (from the penultimate hidden layer) to represent speech audio features usable for speaker classification.
Facebook published several pretrained models that can be accessed
from hugginface.

a) get git-lfs

First, you should get the git version that supports large file download (the model we target here is about 1.5 GB).
In linux ubuntu, I do

sudo apt install git-lfs

with other operating systems you should google how to get git-lfs

b) download the model

Go to the huggingface repository for the model and copy the url.

Clone the repository somewhere on your computer:

git-lfs clone https://huggingface.co/facebook/wav2vec2-large-robust-ft-swbd-300h

set up nkululeko

in your nkululeko configuration (*.ini) file, set the feature extractor as eav2vec and denote the path to the model like this:

[FEATS]
type = wav2vec
model = /my path to the huggingface model/

That's all!