Sometimes, with categorically labeled data, the number of samples per class is very unevenly distributed, misleading the model to think that the overwhelming majority class is more important than the others.
In this case, two techniques might help: class weighting assigns a higher weight to samples from minority classes, and oversampling "invents" new samples for the minority classes.
With nkululeko since version 0.70.0, you can oversample the trainig set with different algorithms implemented by the imb_learn package.
You simply state the method in the FEATS section like so:
[FEATS]
...
balancing = adasyn # either ros, smote or adasyn
These methods are available:
- ros: simply repeat random samples from the minority classes
- smote: "invent" new minority samples by little changes from the existing ones
- adasyn: similar to smote, but resulting in uneven class distributions
- borderlinesmote: SMOTE variant focusing on borderline instances
- svmsmote: SMOTE variant using SVM for generating synthetic samples
Under-sampling methods (reduce majority classes):
- clustercentroids: replace majority class clusters with their centroids using K-means clustering
- randomundersampler: randomly remove samples from majority classes
- editednearestneighbours: remove noisy samples using edited nearest neighbors
- tomeklinks: remove Tomek links to clean class boundaries
Combination methods (over-sampling + under-sampling):
- smoteenn: combination of oversampling with SMOTE and undersampling with edited nearest neighbour (ENN)
- smotetomek: combination of SMOTE oversampling and Tomek links undersampling