"Normalizing" or scaling feature values means to shift them to a common range, or distribution with same mean and standard deviation (also called z-transformation).
You would do that for several reasons:
- Artificial neural nets can handle small numbers best, so they all should be in the range -1, 1
- Speakers have their individual ways to speak which you are not interested in if you want to learn a general task, e.g. emotion or age. So you would speaker-normalize the values for each speaker individually. Of course this is in most applications not possible because you don't have already samples of your test speakers.
- You might want to normalize the sexes, because woman typicall have a higher pitch. But another way out is also to use only relative values and not absolute ones.
Mind that you shouldn't use your test set for normalization as it really only should be used for test and is supposed to be unknown. That's why you should compute your normalization parameters on the training set, you can then use them to normalize/scale the test.
Often (kind of always) there is a lack of training data for supervised learning.
One way to tackle this is representation learning which can be done in an self-supervised fashion.
Another approach is to multiply your labeled training data by adding slightly altered versions of it, that would not change the information that is the aim of the detection, for example by adding noise to the data or clipping it.
A third way is to synthesize data based on the labeled training, for example with GANs, VAEs or with rule-based simulation. It can be distinguished if in this case only a parameterized for of the samples (ie. the features) or raw samples are generated.
Sometimes only samples for a rare class are needed, in this case techniques like ROS (random over sampling), Synthetic Minority Oversampling Technique (SMOTE) or the Adaptive Synthetic (ADASYN) can be used.
The parameters that configure machine learning algorithms are called meta parameters in contrast to the "normal" parameters that are learned during training.
But as they obviously also influence the quality of your predictions, these parameters also must be learned.
- the C parameter for SVM
- the number of subsamples for XGB
- the number of layers and neurons for a neural net
The naive approach is simply to try them all,
how to do this with Nkululeko is described here
But in general, because the search space for the optimal configuration usually is without limit, it'd be better to try a stochastic approach or a genetic one.
This is a first of a series of posts to support my lecture "speech processing with machine learning".
Focus is an introduction to topics related, mainly machine learning as i teach phoneticians which already know a lot about speech.
This page is the landing page which serves as a table of contents for the posts, i will try to introduce a meaningful order for the posts, but sequential read is not required. As said, it's introductory anyway and it's very easy to find much deeper posts on the net. E.g. here's a great list with pictures