How to install Speechalyzer/Labeltool

I wrote a java tool to annotate/transcribe speech data and would like to show in this blog how to run on your system.

First of all, the software is programmed in Java, so you need a java installation on your system, there are two flavours:

  • a JDK (java development kit) would be one to use if you plan to program in Java,
  • a JRE (Java runtime environment) is sufficient to run programs written in Java such as the Speechalyzer. so both (JDK or JRE) work

To test wether you got Java on your system you might want to open a shell/terminal/console (i.e. a window where you can type in system commands) and type

java -version

which either should output a response from the Java interpreter displaying the version or an error message that the program is not installed. As Java is requested to run Speechalyzer, please make sure it is installed.

The next step would be to download the Speechalyzer which actually comes as two softwares:

  • Speechalyzer is the main program which acts as a server to process audio files and actually can be run standalone.
  • Labeltool is the GUI client for Speechalyzer and can be started when Speechalyzer is running to interact with the program via point and click.

To install the programs, click on the links above, click on the "code" dropdown menues on the github pages and select either "as zip file" or use git. If you don't know git I strongly recommend to learn about it and use it it's a mighty tool to version and backup your work, but for know let's assume you use zip.

Save the zip files somewhere on your computer hard disk, perhaps create an own folder "programs" or "research" on your user home folder.

Unzip both folders.

Both of them have configuration files which should be edited with an arbitrary text editor.

Speechalyzer has a file called "speechalyzer.properties" which is located in the "res" folder in the main folder. So if you work with a linuy system, you might want to type

cd Speechalyzer-master
pico res/speechalyzer.properties

and change at least the values for "file type" and "sample rate" to something that makes sense for your audio files.

To adapt the Labeltool to your needs is a bit more complicated so I wrote an own blog post on this

If all went well you're set up and could try the Speechalyzer by printing out its useage in the shell:

java -jar Speechalyzer.jar -h

There are two options to load audio files:
1) copy them to the "recording" directory in the Speechalyzer folder
2) specify the path at startup:

java -jar Speechalyzer.jar -rd /path/to/my/audio/files

either way, you should see a startup message from the program stating how many files where loaded.

You might then want to open another shell/console/terminal, navigate to the Labeltool folder and start to program with

java -jar Labeltool.jar

which should results in a startup window with loaded audio files:

How to adapt Speechalyzer/labeltool to your own labels/experiments

I wrote a java tool to annotate/transcribe speech data and would like to show in this blog how to edit the configuration for an adapted layout of the GUI (Labeltool is the GUI of Speechalyzer).
If you start Labeltool without a Speechalyzer server running it should give an error but for tis demonstration it could be ignored:

java -jar Labeltool.jar 

Your GUI might look like this or different, it depends on the configuration. The configuration is a text file called labeltool.config that should reside in the same folder like Labeltool.jar. You can open it with a text editor of your choice:

and in the upper section you can try out to hide or show GUi panels by setting the switches to true or false.
You can not switch off the Label panel as this is the most basic of Labeltool and always there. In the lower section you would find some button configurations:

So the categoryNames field decides which button series is shown (I hope the rest is self explanatory).
In the example config I depicted above, the Labeltool would look like this (if you closed and re-opened the GUI):

Disclaimer:
I you set

withRecoderControl=false

you will not be able to play any sound (because this logic is behinde the then-hidden play button)

Plot two parameters for categories

This is an examle how to plot values for two parameters in on plot and builds upon the dta generated at this example.
So, from the features you extracted you would isolate two parameters from the dataframe:

x1 = df_feats.loc[:, 'F0semitoneFrom27.5Hz_sma3nz_amean']
x2 = df_feats.loc[:, 'F0semitoneFrom27.5Hz_sma3nz_stddevNorm']

You'd need matplotlib

import matplotlib.pyplot as plt

You would color the dots according to the emotion they have been labeled with. Because the plot function does not accept string values as color designators but only numbers, you'd first have to convert them, e.g. with the LabelEncoder:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
c_vals = le.fit_transform(df_emo.emotion.values)

and then you can simply do the plot:

plt.scatter(x1, x2, c=c_vals)
plt.show()

How to create an audformat Database from a pandas Dataframe

This tutorial explaines how to intitialize an audformat database object from a data collection that's store in a pandas dataframe.
You can also find an official example using emo db here

First you would need the neccessary imports:

import os                       # file operations
import pandas as pd             # work with tables
pd.set_option('display.max_rows', 10)

import audformat.define as define  # some definitions
import audformat.utils as utils    # util functions
import audformat
import pickle

We load a sample pandas dataframe from a speech collection labeled with age and gender.

df = pickle.load(open('../files/sample_df.pkl', 'rb'))
df.head(1)


We can then construct an audformat Databse object from this data like this

# remove the absolute path to the audio samples 
root = '/my/example/path/'
files = [file.replace(root, '') for file in df.index.get_level_values('file')]

# start with a general description
db = audformat.Database(
    name='age-gender-samples',
    source='intern',
    usage=audformat.define.Usage.RESEARCH,
    languages=[audformat.utils.map_language('de')],
    description=(
        'Short snippets  annotated by '
        'speaker and speaker age and gender.'
    ),
)
# add audio format information
db.media['microphone'] = audformat.Media(
    type=audformat.define.MediaType.AUDIO,
    sampling_rate=16000,
    channels=1,
    format='wav',
)
# Describe the age data
db.schemes['age'] = audformat.Scheme(
    dtype=audformat.define.DataType.INTEGER,
    minimum=0,
    maximum=100,
    description='Speaker age in years',
)
# describe the gender data
db.schemes['gender'] = audformat.Scheme(
    labels=[
        audformat.define.Gender.FEMALE,
        audformat.define.Gender.MALE,
    ],
    description='Speaker sex',
)
# describe the speaker id data
db.schemes['speaker'] = audformat.Scheme(
    dtype=audformat.define.DataType.STRING,
    description='Name of the speaker',
)
# initialize a data table with an index which corresponds to the file names
db['files'] = audformat.Table(
    audformat.filewise_index(files),
    media_id='microphone',
)
# now add columns to the table for each data item of interest (age, gender and speaker id)
db['files']['age'] = audformat.Column(scheme_id='age')
db['files']['age'].set(df['age'])
db['files']['gender'] = audformat.Column(scheme_id='gender')
db['files']['gender'].set(df['gender'])
db['files']['speaker'] = audformat.Column(scheme_id='speaker')
db['files']['speaker'].set(df['speaker'])

and finally inspect the result

db

name: age-gender-sample
description: Short snippets annotated by speaker and speaker age and gender.
source: intern
usage: research
languages: [deu]
media:
  microphone: {type: audio, format: wav, channels: 1, sampling_rate: 16000}
schemes:
  age: {description: Speaker age in years, dtype: int, minimum: 0, maximum: 100}
  gender:
    description: Speaker sex
    dtype: str
    labels: [female, male]
  speaker: {description: Name of the speaker, dtype: str}
tables:
  files:
    type: filewise
    media_id: microphone
    columns:
      age: {scheme_id: age}
      gender: {scheme_id: gender}
      speaker: {scheme_id: speaker
      }

and perhaps as a test get the unique valuesof all speakers:

    db.tables['files'].df.speaker.unique()

Important: note that the path to the audiofiles needs to be relative to where the database.yaml file resides and is not allowed to start with "./", so if you do

db.files[0]

this should result in something like

audio/mywav_0001.wav

Feature scaling

Usually machine learning algorithms are not trained with raw data (aka end-to-end) but with features that model the entities of interest.
With respect to speech samples these features might be for example average pitch value over the whole utterance or length of utterance.

Now if the pitch value is given in Hz and the length in seconds, the pitch value will be in the range of [80, 300] and the length, say, in the range of [1.5, 6].
Machine learning approaches now would give higher consideration on the avr. pitch because the values are higher and differ by a larger amount, which is in the most cases not a good idea because it's a totally different feature.

A solution to this problem is to scale all values so that the features have a mean of 0 and standard deviation of 1.
This can be easily done with the preprocessing API from sklearn:

from sklearn import preprocessing
scaler = StandardScaler()
scaled_features = preprocessing.scaler.fit_transform(features)

Be aware that the use of the standard scaler only makes sense if the data follows a normal distribution.

Recording and transcribing a speech sample on Google colab“

Set up the recording method using java script:

# all imports
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(fn, sec):
  display(Javascript(RECORD))
  s = output.eval_js('record(%d)' % (sec*1000))
  b = b64decode(s.split(',')[1])
  with open(fn,'wb') as f:
    f.write(b)
  return fn

Record something:

 filename = 'felixtest.wav'
record(filename, 5)

Play it back:

import IPython
IPython.display.Audio(filename)

install Google speechbrain

%%capture
!pip install speechbrain
import speechbrain as sb

Load the ASR nodel train on libri speech:

from speechbrain.pretrained import EncoderDecoderASR
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_model")

And get a transcript on your audio:

asr_model.transcribe_file(audio_file )

Record sound from microphone

This works if you got "PortAudio" on your system.

import audiofile as af
import sounddevice as sd

def record_audio(filename, seconds):
    fs = 16000
    print("recording {} ({}s) ...".format(filename, seconds))
    y = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
    sd.wait()
    y = y.T
    af.write(filename, y, fs)
    print("  ... saved to {}".format(filename))

Use speechalyzer to walk through a large set of audio files

I wrote speechalyzer in Java to process a large set of audio files. Here's how you could use this on your audio set.

Install and configure

1) Get it and put it somewhere on your file system, don't forget to also install its GUI, the Labeltool
2) Make sure you got Java on your system.
3) Configure both programs by editing the resource files.

Run

The easiest case is if all of your files are in one directory. You would simply start the Speechalyzer like so (you need to be in the same directory):

java - jar Speechalyzer.jar -rd <path to folder with audio files> &

make sure you configured the right audio extension and sampling rate in the config file (wav format, 16kHz is default).
Then change to the Labeltool directory and start it simply like this:

java - jar Labeltool.jar &

again you might have to adapt the sample rate in the config file (or set it in the GUI). Note you need to be inside the Labeltool directory. Here is a screenshot of the Labeltool displaying some files which can be annotated, labeled or simply played in a chain: