Recording and transcribing a speech sample on Google colab“

Set up the recording method using java script:

# all imports
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)

def record(fn, sec):
  s = output.eval_js('record(%d)' % (sec*1000))
  b = b64decode(s.split(',')[1])
  with open(fn,'wb') as f:
  return fn

Record something:

 filename = 'felixtest.wav'
record(filename, 5)

Play it back:

import IPython

install Google speechbrain

!pip install speechbrain
import speechbrain as sb

Load the ASR nodel train on libri speech:

from speechbrain.pretrained import EncoderDecoderASR
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_model")

And get a transcript on your audio:

asr_model.transcribe_file(audio_file )

Use speechalyzer to walk through a large set of audio files

I wrote speechalyzer in Java to process a large set of audio files. Here's how you could use this on your audio set.

Install and configure

1) Get it and put it somewhere on your file system, don't forget to also install its GUI, the Labeltool
2) Make sure you got Java on your system.
3) Configure both programs by editing the resource files.


The easiest case is if all of your files are in one directory. You would simply start the Speechalyzer like so (you need to be in the same directory):

java - jar Speechalyzer.jar -rd <path to folder with audio files> &

make sure you configured the right audio extension and sampling rate in the config file (wav format, 16kHz is default).
Then change to the Labeltool directory and start it simply like this:

java - jar Labeltool.jar &

again you might have to adapt the sample rate in the config file (or set it in the GUI). Note you need to be inside the Labeltool directory. Here is a screenshot of the Labeltool displaying some files which can be annotated, labeled or simply played in a chain:

How to compare formant tracks extracted with opensmile vs. Praat

Note to install not parselmouth but the package praat-parselmouth:

!pip install praat-parselmouth

First, some imports

import pandas as pd
import parselmouth 
from parselmouth import praat
import opensmile
import audiofile

Then, a test file:

testfile = '/home/felix/data/data/audio/testsatz.wav'
signal, sampling_rate =
print('length in seconds: {}'.format(len(signal)/sampling_rate))

Get the opensmile formant tracks by copying them from the official GeMAPS config file

smile = opensmile.Smile(
result_df = smile.process_file(testfile)
centerformantfreqs = ['F1frequency_sma3nz', 'F2frequency_sma3nz', 'F3frequency_sma3nz']
formant_df = result_df[centerformantfreqs]

Get the Praat tracks (smile configuration computes every 10 msec with frame length 20 msec)

sound = parselmouth.Sound(testfile) 
formants =, "To Formant (burg)", 0.01, 4, 5000, 0.02, 50)
f1_list = []
f2_list = []
f3_list = []
for i in range(2, formants.get_number_of_frames()+1):
    f1 = formants.get_value_at_time(1, formants.get_time_step()*i)
    f2 = formants.get_value_at_time(2, formants.get_time_step()*i)
    f3 = formants.get_value_at_time(3, formants.get_time_step()*i)

To be sure: compare the size of the output:

print('{}, {}'.format(result_df.shape[0], len(f1_list)))

combine and inspect the result:

formant_df['F1_praat'] = f1_list
formant_df['F2_praat'] = f2_list
formant_df['F3_praat'] = f3_list

How to extract formant tracks with Praat and python

This tutorial was adapted based on the examples from David R Feinberg

This tutorial assumes you started a Jupyter notebook . If you don't know what this is, here's a tutorial on how to set one up (first part)

First you should install the parselmouth package, which interfaces Praat with python:

!pip install -U praat-parselmouth

which you would then import:

import parselmouth 
from parselmouth import praat

You do need some audio input (wav header, 16 kHz sample rate)

testfile = '/home/felix/data/data/audio/testsatz.wav'

And would then read in the sound with parselmouth like this:

sound = parselmouth.Sound(testfile) 

Here's the code to extract the first three formant tracks, I guess it's more or less self-explanatory if you know Praat.

First, compute the occurrences of periodic instances in the signal:

pointProcess =, "To PointProcess (periodic, cc)", f0min, f0max)

then, compute the formants:

formants =, "To Formant (burg)", 0.0025, 5, 5000, 0.025, 50)

And finally assign formant values with times where they make sense (periodic instances)

numPoints =, "Get number of points")
f1_list = []
f2_list = []
f3_list = []
for point in range(0, numPoints):
    point += 1
    t =, "Get time from index", point)
    f1 =, "Get value at time", 1, t, 'Hertz', 'Linear')
    f2 =, "Get value at time", 2, t, 'Hertz', 'Linear')
    f3 =, "Get value at time", 3, t, 'Hertz', 'Linear')

How to synthesize a text to speech with Google speech API

This tutorial assumes you started a Jupyter notebook . If you don't know what this is, here's a tutorial on how to set one up (first part)

There is a library for this that's based on the Google translation service that still seems to work: gtts.
You would start by installing the packages used in this tutorial:

!pip install -U gtts pygame python-vlc

The you can import the package:

from gtts import gTTS

, define a text and a configuration:

text = 'Das ist jetzt mal was ich so sage, ich finde das Klasse!'
tts = gTTS(text, lang='de')

and synthesize to a file on disk:

audio_file = './hello.mp3'

which you could then play back with vlc

from pygame import mixer  
import vlc
p = vlc.MediaPlayer(audio_file)

How to get my speech recognized with Google ASR and python

What you need to do this at first is to get yourselg a Google API key,

  • you need to register with Google speech APIs, i.e. get a Google cloud platform account
  • you need to share payment details, but (at the time of writing, i think) the first 60 minutes of processed speech per month are free.

I export my API key each time I want to use this like so:

export GOOGLE_APPLICATION_CREDENTIALS="/home/felix/data/research/Google/api_key.json"

This tutorial assumes you did that and you started a Jupyter notebook . If you don't know what this is, here's a tutorial on how to set one up (first part)

Bevor you can import the Google speech api make shure it's installed:

!pip  install google-cloud 
!pip install --upgrade google-cloud-speech

Then you would import the Google Cloud client library

from import speech
import io

Instantiate a client

client = speech.SpeechClient()

And load yourself a recorded speech file, should be wav format 16kHz sample rate

speech_file = '/home/felix/tmp/google_speech_api_test.wav'

if you run into problems recording one: here is the code that worked for me:

import sounddevice as sd
import numpy as np
from import write
sr = 16000  # Sample rate
seconds = 3  # Duration of recording
data = sd.rec(int(seconds * fs), samplerate=sr, channels=1)
sd.wait()  # Wait until recording is finished
# Convert `data` to 16 bit integers:
y = (np.iinfo(np.int16).max * (data/np.abs(data).max())).astype(np.int16) 
wavfile.write(speech_file fs, y)

then get yourself an audio object

with, "rb") as audio_file:
    content =
audio = speech.RecognitionAudio(content = content)

Configure the ASR

config = speech.RecognitionConfig(

Detects speech in the audio file

response = client.recognize(config=config, audio=audio)

and show what you got (with my trial only the first alternative was filled):

for result in response.results:
    for index, alternative in enumerate(result.alternatives):
        print("Transcript {}: {}".format(index, alternative.transcript))


Prof. Dr. Felix Burkhardt does teaching, consulting, research and development on speech communication, human-machine dialog systems, text-to-speech synthesis, speaker classification, ontology based natural language modeling and emotional human-machine interfaces.

Originally an expert of Speech Synthesis at the Technical University of Berlin, he wrote his ph.d. thesis on the simulation of emotional speech by machines, recorded the Berlin acted emotions database, "EmoDB", and maintains several open source projects, including the emotional speech synthesizer "Emofilt" and the speech labeling and annotation tool "Speechalyzer". Since 2018 he is the research director at audEERING after having worked for the Deutsche Telekom AG for 18 years. In addition he's currently a full professor at the institute of communication science of the Technical University of Berlin.

He was a member of the European Network of Excellence HUMAINE on emotion-oriented computing and is the editor of the W3C Emotion Markup Language specification and serves the program committee for numerous conferences including ACII, AVEC, EmoSPACE, FLAIR, IASTED CI, ICASSP, ICMI, ICPhS, Interspeech, IVA, IWSDS, LREC, Paraling, Prosico, SLSP, WS3P, journals: Specom, CSL, JASA, EURASIP, SIGPRO, IEEE-TAFFC, IEEE-TMM, IEEE-TASL, IEEE-TIP, IEEE-ISSI, IJSE, ETRI, Journal of Phonetics, Neural Processing Letters, UMUAI, and publishers: Wiley

How to extract formant center frequencies (or other acoustic features) from speech data with opensmile in python

There is a framework called OpenSMILE published on Github that can be used to extract high level acoustic features from audio signals and I’d like to show you how to use it with Python.

I’ve set up a notebook for this here.

First you need to install opensmile.

pip install opensmile

General procedure

There are two ways to extract a specific acoustic feature with opensmile:

1) Use an existing config that contains your target feature and filter it from the results
2) Write your own config file and extract only your target feature directly

Method 1 is easier but obviously not resource efficient, 2 is better but then to learn the opensmile config syntax and all the existing modules is not trivial.

Using one example for an acoustic feature: formants, we’ll do both ways. The documentation for the python wrapper of opensmile is here

The following assumes you got a test wave file recorded and stored somewhere:

testfp = '/kaggle/input/testdata/testsatz.wav'

Method 1): Use an existing config file that includes the first three formant frequencies

We start with instantiating the main extractor class, Smile, with a configuration that includes formants. The GeMAPSv01b features set has been derived from the GeMAPS feature set

smile = opensmile.Smile(

Extract this for our test sentence, out comes a pandas dataframe

result_df = smile.process_file(testfp)

Now use only the three center formant frequencies

centerformantfreqs = [‘F1frequency_sma3nz’, ‘F2frequency_sma3nz’, ‘F3frequency_sma3nz’]
formant_df = result_df[centerformantfreqs]

This should be your ouput: per frame three values: the center frequencies of the formants:


Method 2): Write your own config file

The documentation for opensmile config files is here.
Most often it is probably easier to look at an existing config file and copy/paste the components you need.

You could edit the opensmile config in a string:

formant_conf_str = '''

;;; default source

;;; source

\{\cm[source{?}:include external source]}

;;; main section

instance[framer].type = cFramer
instance[win].type = cWindower
instance[fft].type = cTransformFFT
instance[resamp].type = cSpecResample
instance[lpc].type = cLpc
instance[formant].type = cFormantLpc

reader.dmLevel = wave
writer.dmLevel = frames
copyInputName = 1
frameMode = fixed
frameSize = 0.025000
frameStep = 0.010000
frameCenterSpecial = left
noPostEOIprocessing = 1



targetFs = 11000



;;; sink

\{\cm[sink{?}:include external sink]}

which you can save as a config file:

with open('formant.conf', 'w') as fp:

Now we reinstantiate our smile object with the custom config

smile = opensmile.Smile(

and extract again

formant_df_2 = smile.process_file(testfp)

Voila! The output is should be similar to the one you got with the first method.

Machine classification of emotional speech with EmoDB and python

This is a tutorial on how to

  • configure a python environment with Jupyter notebook
  • download Berlin EmoDB
  • import the audformat database
  • extract acoustic features with opensmile
  • perform a machine classification with sklearn

It does expect some experience with

  • unix commands
  • python
  • pandas

So if you miss this you might have to google the stuff you don't understand.
In case you know German and use Windows I recorded this screencast for you.

There is a Kaggle notebook that you could use to try this out.

Configure a python environment

I start from the point where you got python installed on your machine and have a shell (console window).
I use Unix commands here, most of them should also work on Mac OS, for Windows you might have to adapt some (e.g. 'ls' becomes 'dir').

So if you type 


in your shell, the python interpreter should start, you can quit it with the command 


Create a subfolder for your project and enter it, e.g.

mkdir emodb; cd emodb

Create a virtual environment for your project

python3 -m venv ./venv 

Activate your project

source ./venv/bin/activate

which should result in your prompt including the environment name like e.g. this


You can leave your environment with the


command. For now though, please make sure you have the environment activated. You can then install the most important packages with pip like this:

pip install pandas numpy jupyter audformat opensmile sklearn matplotlib

If all goes well, you should now be able to start up the jupyter server which should give you an interface in your browser window.

jupyter notebook &

And create a new notebook by clicking the "New" button near the left top corner.

Get and unpack the Berlin Emodb emotional database

You would start by downloading and unpacking emodb like this (of course you can do this as well outside the notebook in your shell):

!wget -c
!mv download


import audformat
db = audformat.Database.load('./emodb')

you can load the database and inspect the contents.

You still have to state the absolute path to the audio files for further processing. You would find the current directory with the


command, and would add the emodb folder name to it and prefix this to the wav file paths like so

import os
root = '/ current directory.../emodb/'
db.map_files(lambda x: os.path.join(root, x))

To check that this worked you might want to listen to a sample file

import IPython

which should give you a GUi like this

Extract acoustic features

EmoDB is annotated with emotional labels. If we want to classify these emotions automatically we need to extract acoustic features first.

We can do this easily in python with dedicated packages for this like the Praat software or opensmile. In this tutorial we'll use opensmile.

First we will get the Pandas dataframes from the database like this:

df_emo = db.tables['emotion'].df
df_files = db.tables['files'].df

and might want to inspect the class distribution with pandas


then, with

import opensmile
smile = opensmile.Smile(

you construct your feature extractor and with

df_feats = smile.process_files(df_emo.index)

should be able to extract the 62 GeMAPS acoustic features, which you could check by looking at the dimension of the dataframe


and looking at the first entry


You might run into trouble later because the smile.process function per default results in a multiindex with filename, start and end time (because you might have extracted low level features per frame). In the following picture i show my screen so far to illustrate the situation.

So you end up with three data frames:

  • df_emo with emotion labels and confidence
  • df_files with duration, speaker id and transcript index
  • df_feats with the features
    So in order to be able to match the indeces from this three dataframes, I dropped the start and end columns from the features dataframe with the following lines:

    df_feats.index = df_feats.index.droplevel(1) # drop start level
    df_feats.index = df_feats.index.droplevel(1) # drop end level

Perform a statistical classification on the data

Now we would conclude this tutorial by performing a first machine classification.
You basically need four sets of data for this: each a feature and label set for a training and a test (or better: development) set.

In a naive approach, we use the first 100 entries of the EmoDB for test and the others for training:

test_labels = df_emo.iloc[:100,].emotion
train_labels = df_emo.iloc[100:,].emotion
test_feats = df_feats.iloc[:100,]
train_feats =  df_feats.iloc[100:,]

There are numerous possibilities to use machine classifiers in python, if we don't want to code one ourselves we might want to use on from the sklearn package, for example an implementation of the SVM (support vector machine) algorithm

from sklearn import svm
clf = svm.SVC()

train it with our training features and labels, train_labels)

, compute predictions on the test features

pred_labels  = clf.predict(test_feats)

and compare the predictions with the real labels (aka ''ground truth'') with a confusion matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels.emotion, pred_labels)

and by computing the unweighted average recall (UAR)

from sklearn.metrics import recall_score
recall_score(test_labels, pred_labels, average='macro')

This results in chance level as the SVM classifier lazily always decided on the majority class. The results can be improved to something more meaningful, e.g. by passing better meta parameters when constructing the classifier:

clf = svm.SVC(kernel='linear', C=.001)

and repeating the experiment, which should result in a confusion matrix like this one (see Kaggle notebook for code):


This concludes the tutorial so far, what to do next?

Here are some suggestions:

  • What is really problematic with the above approach is that the training and the test set are not speaker independent, i.e. the same 10 speakers appear in both sets.
    • Which means you can not know if the classifier learned anything about emotions or (more probable) some idiosyncratic peculiarities of the speakers.
    • With so few speakers it doesn't make a lot of sense to further divide them, so what people often do is perform a LOSO (leave-one-speaker-out) or do x-cross validation by testing x times a part of the speakers against the others (in the case of EmoDB this would be the same if x=10).
  • What's also problematic is that you only looked at one (very small, highly artificial) database and this usually does not result in a usable model for human emotional behavior.
    • Try to import a different database or record your own, map the emotions to the EmoDB set and see how this performs.
  • SVMs are great, but you might want to try other classifiers.
    • Perform a grid search on the best meta-parameters for the SVM.
    • Try other sklearn classifiers.
    • Try other famous classifiers like e.g. XGBoost.
    • Try ANNs (artificial neural nets) with keras or torch.
  • Try other features
    • There are other opensmile feature set configurations available.
    • Do feature selection and to identify the best ones to see if they make sense.
    • Try other features, e.g. from Praat or other packages.
    • Try embeddings from pretrained ANNs like e.g. Trill or PANN features.
  • The opensmile features are all given as absolute values.
    • Try to normalize them with respect to the training set or each speaker individually.
  • Generalization is often improved by adding acoustic conditions to the training:
    • Try augmenting the data by adding samples mixed with noise or bandpass filters.
  • Last not least: code an interface that lets you test the classifier on the spot.