Nkululeko: export acoustic features

With nkululeko since version 0.85.0 the acoustic features for the test and the train (aka dev) set are exported to the project store.

If you specify the store_format:

[FEATS]
store_format = csv

they will be exported to CSV (comma separated value) files, else PKL (readable by python pickle module).
I.e. you store should then after execution of any nkululeko module that computes features the two files:

  • feats_test.csv
  • feats_train.csv

If you specified scaling the features:

[FEATS]
scale = standard # or speaker

you will have two additional files with features:

  • feats_test_scaled.csv
  • feats_train_scaled..csv

In contrast to the other feature stores, these contain the exact features that are used for training or feature importance exploration, so they might be combined from different feature types and selected via the features value. An example:

[FEATS]
type = ['praat', 'os']
features = ['speechrate_nsyll_dur', 'F0semitoneFrom27.5Hz_sma3nz_amean']
scale = standard
store_format = csv

results in the following feats_test.csv:

file,start,end,speechrate_nsyll_dur,F0semitoneFrom27.5Hz_sma3nz_amean
./data/emodb/emodb/wav/11b03Wb.wav,0 days,0 days 00:00:05.213500,4.028004219813945,34.42206
./data/emodb/emodb/wav/16b10Td.wav,0 days,0 days 00:00:03.934187500,3.0501850763340586,31.227554

....

Nkululeko: how to finetune a transformer model

With nkululeko since version 0.85.0 you can finetune a transformer model with huggingface (and even publish it there if you like).

If you like to have your model published, set:

[MODEL]
push_to_hub = True

Finetuning in this context means to train the (pre-trained) transformer layers with your new training data labels, as opposed to only using the last layer as embeddings.

The only thing you need to do is to set your MODEL type to finetune:

[FEATS]
type = []
[MODEL]
type = finetune

The acoustic features can/should be empty, because the transformer model starts with CNN layers to model the acoustics frame-wise. The frames are then getting pooled by the model for the whole utterance (max. duration the first 8 seconds, the rest is ignored).

The default base model is the one from facebook, but you can specify a different one like this:

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base

duration = 10.5

The parameter max_duration is also optional (default=8) and means the maximum duration of your samples / segments (in seconds) that will be used, starting from 0. The rest is disregarded.

You can use the usual deep learning parameters:

[MODEL]
learning_rate = .001
batch_size = 16
device = cuda:3
measure = mse
loss = mse

but all of them have defaults.

The loss function is fixed to

  • weighted cross entropy for classification
  • concordance correlation coefficient for regression

The resulting best model and the huggingface logs (which can be read by tensorboard) are stored in the project folder.