All posts by felix

How to speech synthesize in German with ESPnet

Here is how to text to speech (TTS) synthesize with a German single female speaker Tacotron2 model and esp2 net

You need the python packages

pip install torch espnet_model_zoo phonemizer

Then you can run

import soundfile
from espnet2.bin.tts_inference import Text2Speech

model = 'https://zenodo.org/record/5150957/files/tts_train_tacotron2_raw_hokuspokus_phn_espeak_ng_german_train.loss.ave.zip?download=1'
text2speech = Text2Speech.from_pretrained(model)

speech = text2speech("Wow, das war ja einfach!")["wav"]
soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16")

How to segment and label a speech database

Segmenting means in this case: splitting a longer audio file based on speech pauses.

This post shows you how to record, segment and then label a speech recording using the Ina speech segmenter and Labeltool.

Record audio

Firstly you need a recording. You might do that with your mobile phone or a microphone connected to your computer using, for example, Audacity.

I'd recommend recording / changing the sample rate to 16 kHz, as this is sufficient for speech recordings.

Let's say you stored your recording in the file longer_test.wav inside a directory named utterances.

Segment the recording

We start doing the segmentation in a python script.
You need some packages installed: pandas, inaSpeechSegmenter and audformat

# we start with the imports
import pandas as pd
from inaSpeechSegmenter import Segmenter
from inaSpeechSegmenter.export_funcs import seg2csv, seg2textgrid
from audformat.utils import to_filewise_index
from audformat import segmented_index

# we then use variables for our recording:
root =  './utterance/'
media = 'longer_test.wav'

# the INA speech segmenter is used very easy:
seg = Segmenter()
segmentation = seg(root+media)

# if curious, try:
print(segmentation)

# then collect the segments that were recognized as human, either female or male:
files, starts, ends = [], [], []
for entry in segmentation:
    kind = entry[0]
    start = entry[1]
    end = entry[2]
    if kind == 'female' or kind == 'male':
        print (f'{media}, {start}, {end}')
        files.append(media)
        starts.append(start)
        ends.append(end)
seg_index = segmented_index(files, starts, ends)

#  this index can now be used by audformat to acutally cut the audio file into segments
df = pd.DataFrame(index = seg_index)
file_list = to_filewise_index(df , root, 'audio_out', progress_bar = True)

# the resulting list can be stored to disk:
file_list.to_csv('file_list.csv', header=False)

Label the recording

labeling means to add metadata to the samples, for example emotional arousal.
There are hundreds of tools to do this, I use of course the one i programmed myself, Speechalyzer 😉
Here's a tutorial how to set this up and how to adapt the tool

If running on linux, you could then start the Speechalyzer with the file list you created like this:

java -jar ~/research/Speechalyzer/Speechalyzer.jar -cf ~/research/Speechalyzer/res/speechalyzer.properties -fl file_list.csv

and then simply start the Labeltool to label the files.

Speechalyzer can then export the labels to a file which can be used by Nkululeko as a labeled speech database in CSV format.

Nkululeko FAQ

This post is the start of a troubleshooting /FAQ list

The number of speakers in my test/train splits seems wrong

  • Did you check if perhaps the speakers from several datasets are the same? If so, you can add a dataname.rename_speakers = True to the configuration. This will prepend the dataset name to the speakers

How to combine feature sets with Nkululeko

If you want to use combine several acoustic parameter (feature) sets with nkululeko, you might state

[FEATS]
type = ['mld', 'praat']
features = ['JitterPCA', 'meanF0Hz', 'hld_sylRate']

This would combine the

  • hld_sylRate feature from MLD
  • JitterPCA feature from Feinberg's Praat features and
  • meansF0Hz feature from Feinberg's Praat features

Of course you could omit the features entry and simply use all of them.

It's interesting to see how many emotions from Berlin Emodb can still be recognized with only these three parameters:

How to use selected features from Praat with Nkululeko

If you want to use acoustic parameters extracted by the wonderful Praat software with nkululeko, you state

[FEATS]
type=['praat']

in the feature section of your config file.
If you like to use only some features of all the ones that are extracted by David R. Feinberg's Praat scripts, you can look at the output and select some of them in the FEAT section, e.g.

type = ['praat']
praat.features = ['speechrate(nsyll / dur)']

You can do the same with opensmile features:

type = ['os']
os.features = ['F0semitoneFrom27.5Hz_sma3nz_amean']

or even combine them

type = ['praat', 'os']
praat.features = ['speechrate(nsyll / dur)']
os.features = ['F0semitoneFrom27.5Hz_sma3nz_amean']

this is actually the same as

type = ['praat', 'os']
features = ['speechrate(nsyll / dur)', 'F0semitoneFrom27.5Hz_sma3nz_amean']

if you would want to combine all of opensmile eGeMAPS features with selected Praat features, you would do:

type = ['praat', 'os']
praat.features = ['speechrate(nsyll / dur)']

It is interesting to see, how many emotions of Berlin EmoDB still get recognized with only mean F0 and Jitter as features:

image

What kind of features are there, you might ask yoursel?
Here's a list:
'duration', 'meanF0Hz', 'stdevF0Hz', 'HNR', 'localJitter',
'localabsoluteJitter', 'rapJitter', 'ppq5Jitter', 'ddpJitter',
'localShimmer', 'localdbShimmer', 'apq3Shimmer', 'apq5Shimmer',
'apq11Shimmer', 'ddaShimmer', 'f1_mean', 'f2_mean', 'f3_mean',
'f4_mean', 'f1_median', 'f2_median', 'f3_median', 'f4_median',
'JitterPCA', 'ShimmerPCA', 'pF', 'fdisp', 'avgFormant', 'mff',
'fitch_vtl', 'delta_f', 'vtl_delta_f''

How to test a trained model on a new test set with Nkululeko

Sometimes you might want to test your already trained model(s) on a new dataset, e.g. because the training took a lot of resources.
If you stored your models during the training this is possible.

[DATA]
databases = ['emodb']
....
[MODEL]
save = True

In a new config file for your experiment that uses a dufferent test set, you set

[DATA]
databases = ['emodb', 'polish']
trains = ['emodb']
tests = ['polish']
strategy = cross_data....
[MODEL]
only_test = True

In the example above, emodb has been used as the training database, and polish in a second experiment later as a test database.

How to compare several MLP layer layouts with each other

Some days ago I showed how you can run several experiments in one go.
Obviously this can be used to compare several ANN layer architectures as an alternative to the approach discussed in this (much earlier) post

There is an example configuration shipped with Nkululeko, and you simply can specify your layer specifications per experiment like this:

classifiers = [
    {'--model': 'mlp',
    '--layers': '\"{\'l1\':16,\'l2\':4}\"'},
    {'--model': 'mlp',
    '--layers': '\"{\'l1\':64,\'l2\':16}\"'},
    {'--model': 'mlp',
    '--layers': '\"{\'l1\':128,\'l2\':32}\"',
    '--learning_rate': '.0001',
    '--drop': '.3',},
    {'--model': 'xgb',
    '--epochs':1},
    {'--model': 'svm',
    '--epochs':1},
]

i.e in this example three MLP classifiers are specified with architectures:

  • (hidden) layer 1 with 16 neurons, and (hidden) layer 2 with 4 neurons
  • one layer with 64 and one with 16 neurons
  • and a third one with
    • one layer with 128 and a second one with 32 neurons,
    • learning rate of .0001 and
    • dropout probability of 30%

and, for comparison:

  • a XGB classifier
  • and a SVM classifier

both only need to be trained one epoch because there are no weights to be adapted.
The MLP classifiers are trained with the epoch number that is specified in the sceleton config file

How to run multiple experiments in one go with Nkululeko

Sometimes you will want to run several experiments without the need to manually start them one after the other, e.g. if you want to run them over night.
This post shows you one way how to do this.
The necessary Python files are part of the Nkululeko distribution.

You need three files:

The value parser

First i created a Python file that accepts nkululeko ini file values as targets, called parse_nkulu.py:

# imports
import sys
sys.path.append("../src")
import constants
import numpy as np
import experiment as exp
import configparser
from util import Util
import argparse
import os.path

def main():

# use the argparse package to parse arguments:
    parser = argparse.ArgumentParser(description='Call the nkululeko framework.')
    parser.add_argument('--data', help='The databases', nargs='*', \
        action='append')
    parser.add_argument('--label', nargs='*', help='The labels for the target', \
        action='append')
    parser.add_argument('--tuning_params', nargs='*', help='parameters to be tuned', \
        action='append')
    parser.add_argument('--model', default='xgb', help='The model type', required=True)
    parser.add_argument('--feat', default='os', help='The model type')
    parser.add_argument('--set', help='The opensmile set')
    parser.add_argument('--with_os', help='To add os features')
    parser.add_argument('--target', help='The target designation')

    args = parser.parse_args()

# Use a prepared config file with values that are stable across experiments:
    config_file = './exp.ini'
    util = Util()
    # test if config is there
    if not os.path.isfile(config_file):
        util.error(f'no such file {config_file}')

    config = configparser.ConfigParser()
    config.read(config_file)

# fill the config file
    if args.data is not None:
        databases = []
        for t in args.data:
            databases.append(t[0])
        print(f'got databases: {databases}')
        config['DATA']['databases'] = str(databases)
    if args.label is not None:
        labels = []
        for l in args.label:
            labels.append(l[0])
        print(f'got labels: {labels}')
        config['DATA']['labels'] = str(labels)
    if args.tuning_params is not None:
        tuning_params = []
        for tp in args.tuning_params:
            tuning_params.append(tp[0])
        config['MODEL']['tuning_params'] = str(tuning_params)
    if args.target is not None:
        config['DATA']['target'] = args.target
    if args.model is not None:
        config['MODEL']['type'] = args.model
    if args.feat is not None:
        config['FEATS']['type'] = args.feat
    if args.with_os is not None:
        config['FEATS']['with_os'] = args.with_os
    if args.set is not None:
        config['FEATS']['set'] = args.set
    name = config['EXP']['name']
    util = Util()
    util.debug(f'running {name}, Nkululeko version {constants.VERSION}')

# Now run the experiment
    # init the experiment
    expr = exp.Experiment(config)
    # load the data
    expr.load_datasets()
    # split into train and test
    expr.fill_train_and_tests()
    # extract features
    expr.extract_feats()
    # initialize a run manager
    expr.init_runmanager()
    # run the experiment
    reports = expr.run()
    result = reports[-1].result.test
    # report result
    util.debug(f'result for {expr.get_name()} is {result}')

if __name__ == "__main__":
    main()

The configuration file

A Nkululeko config file with the constant values for all experiments (to be adapted to your needs and pathes)

[EXP]
root = ./
name = exp
runs = 1
epochs = 1
[DATA]
root_folders = ../data_roots.ini
databases = ['mydata']
target = mytarget
labels = ['label1', 'label2']
[FEATS]
wav2vec.model = xxx/wav2vec2-large-robust-ft-swbd-300h
xbow.model = xxx/openXBOW/
trill.model = xxx/trill_model
mld.model = xxx/mld/src
scale = standard
[MODEL]
C_val = .001
loso = True

The script to specify and run all experiments

Lastly, you need a script to start and specify the experiments, here's an example that combines tweo classifiers and eight feature sets:

import os

classifiers = [
    {'--model': 'xgb'},
    {'--model': 'svm'},
]

features = [
    {'--feat': 'os'},
    {'--feat': 'os', 
    '--set': 'ComParE_2016',
    },
    {'--feat': 'mld'},
    {'--feat': 'mld',
    '--with_os': 'True',
    },
    {'--feat': 'xbow'},
    {'--feat': 'xbow',
    '--with_os': 'True',
    },
    {'--feat': 'trill'},
    {'--feat': 'wav2vec'},
]

for c in classifiers:
    for f in features:
        cmd = f'python parse_nkulu.py '
        for item in c:
            cmd += f'{item} {c[item]} '
        for item in f:
            cmd += f'{item} {f[item]} '
        print(cmd)
        os.system(cmd)

How to do cross validation with Nkululeko

Only for linear classifiers like XGB, SVM, SGR and SVR you have the possibility to disregard training and development splits and do a cross validation, i.e. validate one data set in a circular manner against itself.

The basic idea is that you take part of the data and evaluate against the rest, and in the next round take another part and so forth, until all data has been evaluated. Because the speaker identity is so strong in speech, this is done usually in a speaker exclusive manner, known under the term "leave one speaker out " (LOSO).

If you have too many speakers and/or each speaker really only one sample, you might want to split your speakers into groups and do a "leave one speaker group out" strategy (LOGO).

A related approach is known under the name k fold cross validation, where k usually equals 10.
When you only have one sample per speaker, this might make more sense.
So, how would you do that with Nkululeko?
First, you would define a training and development split for your data anyway, because Nkululeko is expecting it if there is only one database. You might set that to random, it's not used anyway:

[DATA]
mydata.split_strategy = random 

Then in your config file, you specify in the MODEL section either:

[MODEL] 
logo = 10 

to assign 10 groups to your speakers and then evaluate each group against all others.
If you want to do a leave-one-speaker_out experiment (LOSO), simply assign the number for logo the number of your speakers.

If there already is a fold column in your data, this will be used, otherwise Nkululeko will randomly assign folds to speakers.

Or you do

[MODEL] 
k_fold_cross = 5 

for instance to disregard speaker information and simply evaluate 5 times a fifth of the data against the rest.
We use stratified sets, i.e. the algorithm tries to balance the class data within each set.