Category Archives: code

A python class to predict your emotions

This is a post to introduce you to the idea of encapsulating functionality with object-oriented programming.

We simply put the emotional classification of speech that was demonstrated in this post in a python class like this:

import opensmile
import os
import audformat
from sklearn import svm
import sounddevice as sd
import soundfile as sf
from scipy.io.wavfile import write

class EmoRec():
    root = './emodb/'
    clf = None
    filename = 'emorec.wav'
    sr = 16000
    def __init__(self):
        self.smile = opensmile.Smile(
            feature_set=opensmile.FeatureSet.GeMAPSv01b,
            feature_level=opensmile.FeatureLevel.Functionals,
        )
        if not os.path.isdir(self.root):
            self.download_emodb()
        db = audformat.Database.load(self.root)
        db.map_files(lambda x: os.path.join(self.root, x))
        self.df_emo = db.tables['emotion'].df
        self.df_files = db.tables['files'].df
        if not self.clf:
            self.train_model()

    def download_emodb(self):
        os.system('wget -c https://tubcloud.tu-berlin.de/s/LzPWz83Fjneb6SP/download')
        os.system('mv download emodb_audformat.zip')
        os.system('unzip emodb_audformat.zip')
        os.system('rm emodb_audformat.zip')

    def train_model(self):
        print('training a model...')
        df_feats = self.smile.process_files(self.df_emo.index)
        train_labels = self.df_emo.emotion
        train_feats =  df_feats
        self.clf = svm.SVC(kernel='linear', C=.001)
        self.clf.fit(train_feats, train_labels)
        print('done')

    def classify(self, wavefile):
        test_feats = self.smile.process_file(wavefile)
        return self.clf.predict(test_feats)

    def classify_from_micro(self, seconds):
        self.record(seconds)
        return self.classify(self.filename)[0]

    def record(self, seconds):
        data = sd.rec(int(seconds * self.sr), samplerate=self.sr, channels=1)
        sd.wait()  
        write(self.filename, self.sr, data)

def main():
    test = EmoRec()
    print(test.classify_from_micro(3))

if __name__ == "__main__":
    main()

To try this you could store the above in a file called , for example, 'emorec.py' and then in a jupyter notebook, call the constructor

import emorec
emoRec = emorec.EmoRec()

and use the functionality

result = emoRec.classify_from_micro(3)
print(f'emodb thinks your emotion is {result}')

Plot two parameters for categories

This is an examle how to plot values for two parameters in on plot and builds upon the dta generated at this example.
So, from the features you extracted you would isolate two parameters from the dataframe:

x1 = df_feats.loc[:, 'F0semitoneFrom27.5Hz_sma3nz_amean']
x2 = df_feats.loc[:, 'F0semitoneFrom27.5Hz_sma3nz_stddevNorm']

You'd need matplotlib

import matplotlib.pyplot as plt

You would color the dots according to the emotion they have been labeled with. Because the plot function does not accept string values as color designators but only numbers, you'd first have to convert them, e.g. with the LabelEncoder:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
c_vals = le.fit_transform(df_emo.emotion.values)

and then you can simply do the plot:

plt.scatter(x1, x2, c=c_vals)
plt.show()

Feature scaling

Usually machine learning algorithms are not trained with raw data (aka end-to-end) but with features that model the entities of interest.
With respect to speech samples these features might be for example average pitch value over the whole utterance or length of utterance.

Now if the pitch value is given in Hz and the length in seconds, the pitch value will be in the range of [80, 300] and the length, say, in the range of [1.5, 6].
Machine learning approaches now would give higher consideration on the avr. pitch because the values are higher and differ by a larger amount, which is in the most cases not a good idea because it's a totally different feature.

A solution to this problem is to scale all values so that the features have a mean of 0 and standard deviation of 1.
This can be easily done with the preprocessing API from sklearn:

from sklearn import preprocessing
scaler = StandardScaler()
scaled_features = preprocessing.scaler.fit_transform(features)

Be aware that the use of the standard scaler only makes sense if the data follows a normal distribution.

Recording and transcribing a speech sample on Google colab“

Set up the recording method using java script:

# all imports
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(fn, sec):
  display(Javascript(RECORD))
  s = output.eval_js('record(%d)' % (sec*1000))
  b = b64decode(s.split(',')[1])
  with open(fn,'wb') as f:
    f.write(b)
  return fn

Record something:

 filename = 'felixtest.wav'
record(filename, 5)

Play it back:

import IPython
IPython.display.Audio(filename)

install Google speechbrain

%%capture
!pip install speechbrain
import speechbrain as sb

Load the ASR nodel train on libri speech:

from speechbrain.pretrained import EncoderDecoderASR
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_model")

And get a transcript on your audio:

asr_model.transcribe_file(audio_file )