Sometimes, with categorically labeled data, the number of samples per class is very unevenly distributed, misleading the model to think that the overwhelming majority class is more important than the others.
In this case, two techniques might help: class weighting assigns a higher weight to samples from minority classes, and oversampling "invents" new samples for the minority classes.
With nkululeko since version 0.70.0, you can oversample the trainig set with different algorithms implemented by the imb_learn package.
You simply state the method in the FEATS section like so:
[FEATS]
...
balancing = adasyn # either ros, smote or adasyn
Three methods are available:
- ros: simply repeat random samples from the minority classes
- smote: "invent" new minority samples by little changes from the existing ones
- adasyn: similar to smote, but resulting in uneven class distributions