Nkululeko: oversample the training set

Sometimes, with categorically labeled data, the number of samples per class is very unevenly distributed, misleading the model to think that the overwhelming majority class is more important than the others.
In this case, two techniques might help: class weighting assigns a higher weight to samples from minority classes, and oversampling "invents" new samples for the minority classes.
With nkululeko since version 0.70.0, you can oversample the trainig set with different algorithms implemented by the imb_learn package.
You simply state the method in the FEATS section like so:

[FEATS]
...
balancing = adasyn # either ros, smote or adasyn

Three methods are available:

ros: simply repeat random samples from the minority classes
smote: "invent" new minority samples by little changes from the existing ones
adasyn: similar to smote, but resulting in uneven class distributions

speechsurfer

Nkululeko: oversample the training set

Leave a Reply Cancel reply

blog around speech technology