"Normalizing" or scaling feature values means to shift them to a common range, or distribution with same mean and standard deviation (also called z-transformation).
You would do that for several reasons:
- Artificial neural nets can handle small numbers best, so they all should be in the range -1, 1
- Speakers have their individual ways to speak which you are not interested in if you want to learn a general task, e.g. emotion or age. So you would speaker-normalize the values for each speaker individually. Of course this is in most applications not possible because you don't have already samples of your test speakers.
- You might want to normalize the sexes, because woman typicall have a higher pitch. But another way out is also to use only relative values and not absolute ones.
Mind that you shouldn't use your test set for normalization as it really only should be used for test and is supposed to be unknown. That's why you should compute your normalization parameters on the training set, you can then use them to normalize/scale the test.
Often (kind of always) there is a lack of training data for supervised learning.
One way to tackle this is representation learning which can be done in an self-supervised fashion.
Another approach is to multiply your labeled training data by adding slightly altered versions of it, that would not change the information that is the aim of the detection, for example by adding noise to the data or clipping it.
A third way is to synthesize data based on the labeled training, for example with GANs, VAEs or with rule-based simulation. It can be distinguished if in this case only a parameterized for of the samples (ie. the features) or raw samples are generated.
Sometimes only samples for a rare class are needed, in this case techniques like ROS (random over sampling), Synthetic Minority Oversampling Technique (SMOTE) or the Adaptive Synthetic (ADASYN) can be used.
The parameters that configure machine learning algorithms are called meta parameters in contrast to the "normal" parameters that are learned during training.
But as they obviously also influence the quality of your predictions, these parameters also must be learned.
- the C parameter for SVM
- the number of subsamples for XGB
- the number of layers and neurons for a neural net
The naive approach is simply to try them all,
how to do this with Nkululeko is described here
But in general, because the search space for the optimal configuration usually is without limit, it'd be better to try a stochastic approach or a genetic one.
In supervised machine learning, you usually need three kinds of data sets:
- train data: to teach the model the relation between data and labels
- dev data: (short for development) to tune meta parameters of your model, e.g. number of neurons, batch size or learning rate.
- test data: to evaluate your model ONCE at the end to check on generalization
Of course all this is to prevent overfitting on your train and/or dev data.
If you've used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments.
So what's a good split?
Some rules apply:
- train and dev can be from the same set, but the test set is ideally from a different database.
- if you don't have so much data, a 80/20/20 % split is normal
- if you have masses an data, use only so much dev and test that your population seems covered.
- If you have really little data: use x cross validation for train and dev, still the test set should be extra