seminar | speechsurfer

This is a first of a series of posts to support my lecture "speech processing with machine learning".
Focus is an introduction to topics related, mainly machine learning as i teach phoneticians which already know a lot about speech.

This page is the landing page which serves as a table of contents for the posts, i will try to introduce a meaningful order for the posts, but sequential read is not required. As said, it's introductory anyway and it's very easy to find much deeper posts on the net. E.g. here's a great list with pictures

Links that are marked with (nkulu) are for posts that use Nkululeko as a hands-on exercise.

How does it work in general? -> learning from data
Supervised or not? (nkulu): Main distintions for machine learning
- learning by example (Supervised)
- Unsupervised
  - clustering
  - representation learning/ Self-Supervised
  - learning by interaction -> Reinforcement Learning
Splits: test, train and dev (nkulu): How to learn what from data
Evaluation: Kinds of evaluation metrics
Meta parameter tuning: How to tune your predictor
Augmentation: Enhance generalization by adding altered training samples
Feature normalization/scaling: Shift the feature values to a common value range.
Kinds of machine learning: A taxonomy of buzzwords around articial neural nets.
Different machine learners: Introducing the most common approaches to machine learning
Transformation architectures: Introducing the architectural differences od input/output processing

Media links

This post is a seminar idea sketch. I try to think up a concept for a seminar and link other blog posts from here if they can help to solve the tasks.

Here's a collection of software recommendations that you might want to install /try before the seminar.

Tasks

Get a recording

Record a speech of yourself of 3-5 minutes length on a topic that you find emotinally challenging, meaning something you feel strongly about. try to express your feelings while you speak.
Obviously you might use some other emotional recordings that you collect.
Convert all into a dedicated audio format, usually 16 kHz sample rate, mono channel, >= 16bit quantization should suffice.
You might consider storing your data in audformat to be comaptibel with further investigations.

Segment the recording

Perform a segmentation on your recording into parts that have about the right size to carry an emotional expression.

In a dialog situation a segment would come naturally as it would correspond to the speech segments alternating between the dialog partners (and then would be called a "turn").
A typical lenght is about 3-7 seconds
The segmentation can be done manually, via a segmentation tool like Praat, Wavesurfer or even Audacity.
An alternative approach is to segment the speech automatically, e.g. by a VAD (voice activity detection) algorithm. A quick search delivers e.g. this software based on Praat or the ina spech segmenter

I wrote a new tutorial on how to segment the data using the ina speech segmenter

Annotate the recording

Decide on a target

Which emotion(s) should be analysed?
Typically, with emotions, you distinguish between categories (like anger, friendlyness, sadness) and dimensions like pleasure, arousal or dominance (also known as PAD space).
If you want to compare across participants it's important you have the same concept of what is your target. Typical candidates would be interest, nervousness or valence.

Decide on a scale

There's a whole standard recommendation on the topic of how to describe emotional states.
Basically, for this seminar you got to decide if it's binary (0/1, on/off, true/false) or graded, like a discreet value on a Likert scale or simply a continuous value in the range [0, 1] or [-1, 1] (also a surprisingly difficult question)
Related to that: with respect to Likert scales the most important question is wether there's a neutral value or not.

Do the annotations

The process of assigning a value to a recording is called annotating, labeling or judging.
As you decide on a subjective value that depends on your self (temporarily as well as in general) this needs to be done in a real world scenario by as many people as possible (a number between 5 and 20 is quite common).
How well this works depends on the target and can be computed by the inter-rater-variability, i.e. the degree the labelers agree with each other. Typical measures for this would be Kappa value or Krippendorff's alpha.
The result of the labeling is a list of the segments with their labels, usually a csv file with as many lines as segments.
You can do this manually (listen to all segments with your favourite audioplayer and fill the list) or use a tool, e.g. Praat, or (obviously I recommend my own tool) the Speechalyzer, which has been developed to support the annotation of very large datasets.
An alternative to annotation would be to use a different physical measure that corresponds well with physical arousal as reference, e.g. physical data like blood pressure skin conductivity or respiration rate.
Of course there's also the possibility to do a continous annotation, i.e disregard segments in favour of a fixed frame size (typically below a second)

Load your data with a data processing environment

With respect to an environment to run the experiments in, I'd recommend python and jupyter notebooks
There's a great python module named pandas that you should get familar with. You will learn not only for this seminar, but be able to process any data in a computer for the rest of your existence!

Extract acoustic features

To perform an acoustic analysis you need to extract some kind of features related to acoustics.
I distinguish here acoustic from linguistic, i.e. I'd treat transcribed words (and their sentiment) as a different modality.
I differentiate between three kinds of features:
- expert features meaning manually selected features that should make sense for the target at hand, e.g. kind of everything you would compute with Praat or the about 80 GeMAPS features.
- brute-force features everything you got at hand: usually a combination of frame-based low-level descriptors (one frame: ~ 10-25 msec, a series of values) and statistical functionals, e.g. the 6000+ ComParE16 features. Leave the decision on what is important to an algorithmic approach, e.g. factor analysis.
- learned features Embeddings computed by an ANN encoder (artificial neural net). These features can usually not be interpreted but can be used in machine learning and are an example for representation learning , end-to-end learning and transfer learning, e.g. the TRILL features.
You can extract/describe features manually (e.g. get speaking time, number of pauses, etc.)
or use an automated software, for example
- openSmile
- Praat (here's a comparison between openSmile and Praat)
- Wavesurfer is also an option.
- Many more python packages, e.g. scypi
- Kind of all feature computation start with a Fourier transformation, i.e. is based on a frame based spectral analysis, each frame being so short (~10 - 25 msec) that speech can be treated as being static, not dynamic.

Analysis

You might want to collect all data from the seminar participants in a common pandas dataframe to be able to generalize your findings across individual speakers.

The most obvious question you can try to answer is: is there a correlation between my emotion value (the dependent variable) and the features that I observe?
Another one would be to look at the effect of independent variables, i.e. other attributes of the speech like speaker, speaker traits (age, sex, dialect), languages.

Statistical measures

Perform analyses on the most important features for the target
Compute correlation coefficients for these features

Visualization

Find good visualizations for correlations
- scatter plots
- box/violin plot per level of target
- cluster plots, clustering the levels of target expressed in color values in a two dimensional feature space (simply use two features or perform a dimensionality reduction algorithm on the features, e.g. a PCA)

Machine learning

Try automatic prediction of your dependent variable based on the data as test data or split into train and test if you got enough. If you split up the data, be sure not to have the same speakers in train and test set, because otherwise you will only learn some ideosyncratic expression of the speakers.

speechsurfer

Category Archives: seminar

Transformation architectures

One to one

One to many

Many to one

Sequence to sequence

Many to many

ML course: introduction