With nkululeko since version 0.85.0 you can finetune a transformer model with huggingface (and even publish it there if you like).
If you like to have your model published, set:
[MODEL]
push_to_hub = True
Finetuning in this context means to train the (pre-trained) transformer layers with your new training data labels, as opposed to only using the last layer as embeddings.
The only thing you need to do is to set your MODEL type to finetune:
[FEATS]
type = []
[MODEL]
type = finetune
The acoustic features can/should be empty, because the transformer model starts with CNN layers to model the acoustics frame-wise. The frames are then getting pooled by the model for the whole utterance (max. duration the first 8 seconds, the rest is ignored).
The default base model is the one from facebook, but you can specify a different one like this:
[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
duration = 10.5
The parameter max_duration is also optional (default=8) and means the maximum duration of your samples / segments (in seconds) that will be used, starting from 0. The rest is disregarded.
You can use the usual deep learning parameters:
[MODEL]
learning_rate = .001
batch_size = 16
device = cuda:3
measure = mse
loss = mse
but all of them have defaults.
The loss function is fixed to
- weighted cross entropy for classification
- concordance correlation coefficient for regression
The resulting best model and the huggingface logs (which can be read by tensorboard) are stored in the project folder.