"Normalizing" or scaling feature values means to shift them to a common range, or distribution with same mean and standard deviation (also called z-transformation).
You would do that for several reasons:
- Artificial neural nets can handle small numbers best, so they all should be in the range -1, 1
- Speakers have their individual ways to speak which you are not interested in if you want to learn a general task, e.g. emotion or age. So you would speaker-normalize the values for each speaker individually. Of course this is in most applications not possible because you don't have already samples of your test speakers.
- You might want to normalize the sexes, because woman typicall have a higher pitch. But another way out is also to use only relative values and not absolute ones.
Mind that you shouldn't use your test set for normalization as it really only should be used for test and is supposed to be unknown. That's why you should compute your normalization parameters on the training set, you can then use them to normalize/scale the test.
Nkululeko exercise
- try different scaling methods for the acoustic features with nkululeko