June | 2021 | speechsurfer

This post is a seminar idea sketch. I try to think up a concept for a seminar and link other blog posts from here if they can help to solve the tasks.

Here's a collection of software recommendations that you might want to install /try before the seminar.

Tasks

Get a recording

Record a speech of yourself of 3-5 minutes length on a topic that you find emotinally challenging, meaning something you feel strongly about. try to express your feelings while you speak.
Obviously you might use some other emotional recordings that you collect.
Convert all into a dedicated audio format, usually 16 kHz sample rate, mono channel, >= 16bit quantization should suffice.
You might consider storing your data in audformat to be comaptibel with further investigations.

Segment the recording

Perform a segmentation on your recording into parts that have about the right size to carry an emotional expression.

In a dialog situation a segment would come naturally as it would correspond to the speech segments alternating between the dialog partners (and then would be called a "turn").
A typical lenght is about 3-7 seconds
The segmentation can be done manually, via a segmentation tool like Praat, Wavesurfer or even Audacity.
An alternative approach is to segment the speech automatically, e.g. by a VAD (voice activity detection) algorithm. A quick search delivers e.g. this software based on Praat or the ina spech segmenter

I wrote a new tutorial on how to segment the data using the ina speech segmenter

Annotate the recording

Decide on a target

Which emotion(s) should be analysed?
Typically, with emotions, you distinguish between categories (like anger, friendlyness, sadness) and dimensions like pleasure, arousal or dominance (also known as PAD space).
If you want to compare across participants it's important you have the same concept of what is your target. Typical candidates would be interest, nervousness or valence.

Decide on a scale

There's a whole standard recommendation on the topic of how to describe emotional states.
Basically, for this seminar you got to decide if it's binary (0/1, on/off, true/false) or graded, like a discreet value on a Likert scale or simply a continuous value in the range [0, 1] or [-1, 1] (also a surprisingly difficult question)
Related to that: with respect to Likert scales the most important question is wether there's a neutral value or not.

Do the annotations

The process of assigning a value to a recording is called annotating, labeling or judging.
As you decide on a subjective value that depends on your self (temporarily as well as in general) this needs to be done in a real world scenario by as many people as possible (a number between 5 and 20 is quite common).
How well this works depends on the target and can be computed by the inter-rater-variability, i.e. the degree the labelers agree with each other. Typical measures for this would be Kappa value or Krippendorff's alpha.
The result of the labeling is a list of the segments with their labels, usually a csv file with as many lines as segments.
You can do this manually (listen to all segments with your favourite audioplayer and fill the list) or use a tool, e.g. Praat, or (obviously I recommend my own tool) the Speechalyzer, which has been developed to support the annotation of very large datasets.
An alternative to annotation would be to use a different physical measure that corresponds well with physical arousal as reference, e.g. physical data like blood pressure skin conductivity or respiration rate.
Of course there's also the possibility to do a continous annotation, i.e disregard segments in favour of a fixed frame size (typically below a second)

Load your data with a data processing environment

With respect to an environment to run the experiments in, I'd recommend python and jupyter notebooks
There's a great python module named pandas that you should get familar with. You will learn not only for this seminar, but be able to process any data in a computer for the rest of your existence!

Extract acoustic features

To perform an acoustic analysis you need to extract some kind of features related to acoustics.
I distinguish here acoustic from linguistic, i.e. I'd treat transcribed words (and their sentiment) as a different modality.
I differentiate between three kinds of features:
- expert features meaning manually selected features that should make sense for the target at hand, e.g. kind of everything you would compute with Praat or the about 80 GeMAPS features.
- brute-force features everything you got at hand: usually a combination of frame-based low-level descriptors (one frame: ~ 10-25 msec, a series of values) and statistical functionals, e.g. the 6000+ ComParE16 features. Leave the decision on what is important to an algorithmic approach, e.g. factor analysis.
- learned features Embeddings computed by an ANN encoder (artificial neural net). These features can usually not be interpreted but can be used in machine learning and are an example for representation learning , end-to-end learning and transfer learning, e.g. the TRILL features.
You can extract/describe features manually (e.g. get speaking time, number of pauses, etc.)
or use an automated software, for example
- openSmile
- Praat (here's a comparison between openSmile and Praat)
- Wavesurfer is also an option.
- Many more python packages, e.g. scypi
- Kind of all feature computation start with a Fourier transformation, i.e. is based on a frame based spectral analysis, each frame being so short (~10 - 25 msec) that speech can be treated as being static, not dynamic.

Analysis

You might want to collect all data from the seminar participants in a common pandas dataframe to be able to generalize your findings across individual speakers.

The most obvious question you can try to answer is: is there a correlation between my emotion value (the dependent variable) and the features that I observe?
Another one would be to look at the effect of independent variables, i.e. other attributes of the speech like speaker, speaker traits (age, sex, dialect), languages.

Statistical measures

Perform analyses on the most important features for the target
Compute correlation coefficients for these features

Visualization

Find good visualizations for correlations
- scatter plots
- box/violin plot per level of target
- cluster plots, clustering the levels of target expressed in color values in a two dimensional feature space (simply use two features or perform a dimensionality reduction algorithm on the features, e.g. a PCA)

Machine learning

Try automatic prediction of your dependent variable based on the data as test data or split into train and test if you got enough. If you split up the data, be sure not to have the same speakers in train and test set, because otherwise you will only learn some ideosyncratic expression of the speakers.

These are some general best practise tips how to organize your seminar project.

Optional: Set up a git account

git is a software that safes your work on the internet so you can always go back to earlier versions if something goes wrong. A bit like a backup system, but also great for collaborative work.

install the "git" software on your computer
go to github.com (or try gitlab.org) and get yourself an account.
make there a new repository, and name it e.g. my-sample-project
if it's a Python project, select the pathon template for the .gitignore file (this will ignore typical python temporary files for upload).
go to the main repository page, open the "code" dropdown button and cope the "clone" URL.
On your computer in a shell/terminal/console, go where your project should reside (I strongly encourage to use a path without whitespace in it) and type
```
git clone <URL>
```
and the project folder should be created and is linked with the git repository.
learn about the basic git commands by searching for a quick tutorial (git cheat sheet).

install python

install a python version, use version >= 3.6

this depends a lot on your operating system.
For Mac and windows it might be enough to type python in your application search and then follow the instructions for installation, if not already installed.

set up a virtual environment

creat a project folder
enter your project folder,
```
cd my-sample-project
```
create a virtual environment that will contain all the python packages that you use in your project:
```
virtualenv -p python3 my-project_env
```
If virtualenv is not installed, you can either install it or create the environment with
```
python3 -m venv my-project_env
```
then activate the environment
```
./my-project_env/bin/activate
```
(might be different for other operating systems)
you should recognize the activated environment by it's name in brackets preceding the prompt, e.g. something like
```
(my-project_env) user@system:/bla/path/$
```

Get yourself a python IDE

IDE means 'integrated desktop environment' and is something like a very comfortable editor for python source files. If you already know and use one of the many, I wouldn't know a reason to switch. If not, I'd suggest you take a look at VSC, the visual studio code editor as it's free of costs, available on many platforms and can be extended with many available plugins.

I've made a screencast (in German) on how to install python and jupyter notebooks on Windows

speechsurfer

Monthly Archives: June 2021

Seminar: Analyze speech for emotional expression