Category Archives: Allgemein

Kinds of machine learning

This post is an attempt to sort out some terms that are used around the topic of machine learning.

AI

meaning "artificial intelligence" is a term that the computer scientist John McCarthy used at the Dartmouth Conference 1956. It's really just a term to make the field sound more interesting. Until today all, so-called AI-systems are simply based on pattern recognition by statistics and I wouldn't know of a good model for human intelligence, or even a definition.

Deep learning

is a fuzzy expression connected with artificial neural nets. What mostly is meant, is that the number of hidden layers (all layers apart from in- and output layer) is rather deep, but it remains unclear how many layers are needed. The more layers, the harder it is to handle the vanishing gradient problem, i.e. that the early layers don't get updated any more during training because the numbers become too small. Another interpretation (and one that makes more sense in my opinion) is, that deep learning refers to the raising level of abstraction of the hidden layers from raw input to abstract labels (e.g. picture pixels to animal names), especially with CNNs. For example, in the early layers mainly edges and contours are represented, in the later layers complex objects like beaks or eyes.

Supervised vs. unsupervised

means the distinction whether your training data is annotated (or labeled) with respect to your task. An example: If you want to build a machine learner for human age estimation based on speech, you might give an algorithm a lot of examples of human speech annotated with the age of the person. This would be your training data and the approach would be supervised (by the age annotations). With unsupervised learning, you would give an algorithm simply a lot of human speech data and might ask it to cluster the data, based on differences. And might hope that the resulting clusters coincide with age.

Semi-supervised learning

means to use so-called soft labels for training, i.e. annotations that were generated by a machine learning predictor. If you have vast amounts of data but only a part of them is annotated, you might try to train a machine learner supervised with the annotated data and then use the resulting model to predict the rest. A variant would be to use the seed model to search for interesting data to be annotated, as for example rare events in you data.

Self-supervised learning

means techniques to prepare an artificial neural net by learning something on the data without predicting a concrete (supervised) feature/label/annotation. This can be done for example by masking parts of the data and training the net on predicting the masked parts as is done with the transformer technique. A different approach would be to use a triplet loss by training the net to distinguish near-by from far-away data. Once pretrained, a self-supervised net can be used for various down-stream tasks, like for example classification or regression of labeled data.

Reinforcement learning

is a fundamentally different kind of machine learning, usually metaphored with "learning like a child". Although mostly it is, it actually does not necessarily require to be implemented with artificial neural nets. The main idea is that an actor receives sensations from an environment that are interpreted and lead to new actions based on an evaluation criterion. As opposed to loss functions with neural learning, the evaluation criterion is rather abstract, like "reach the wall" (for a robot that should learn how to walk) or "win the game" for a chess player. It is rather tricky to translate evaluation criteria to concrete loss functions (needed to train a machine learner), and reinforcement learning requires vast amounts of data that usually are generated by simulations. That's the reason why, although very charming as an idea, reinforcement learning is successful mainly in gaming applications.

Representation learning

is learning to distinguish the essence of the data at hand from noise factors (that might come from the recording of the data). For example with speech it is mainly not interesting which microphones recorded the speakers and how the room acoustics was. All this is present in the data, but usually not important for the task at hand and, in fact, one of the reasons for over fitting (learning the training data but not the task) and lack of generalization (being able to recognize out-of-domain data: from different sources). As modeling data in machine learning is always a dimension reduction, representation learning searches for the dimensions that best represent the interesting aspects of the data. Self-supervised learning is a kind of representation learning.

Transfer learning

means to transfer knowledge from one domain to another. There are many ways to do this, for example you may pretrain your model with data from one domain and then finetune it with the data that represents your application. Self-supervised learning is also a kind of transfer learning. Another approach is multi task learning, where one large artificial neural net is trained for several tasks in parallel, with a so-called multi-head architecture (meaning the last layers are separated for each task).
The main idea is that you can use large quantities of data that are related to your task.

Pretraining and finetuning

means that you pretrain an artificial neural net with some large data sets that are related to you task, or at least have the same modality. You then remove the last layers of you net and add at least an output layer for your task. An example: wav2vec2.0 is a model trained on many hundreds of hours of speech data with ASR (automatic speech recognition) as a main target, but can be used as embeddings to classify emotional expression in speech.

Overfitting

means that the machine learner performs well on the training but not on any other data. This is usually the case when the model has enough complexity to distinguish all training data and is trained for enough periods (one period is one run through the training). Measures against this are subsumed under the label regularization.

Loss function

is the function that artificial neural nets use to track progress, i.e. the function that evaluates the predicted outcome with the desired one. Finding a good loss function is crucial for your task.

Embeddings

are learned representations of data, usually the pen-ultimate layer of a pretrained artificial neural net.

Latent space

means the property of deep artificial neural nets to represent specific features of the data within the higher layers, for example speaker characteristics or expressed emotion in a net trained for speech synthesis. This is often used to influence the output in a desired way, for example simulating a specific speaking style.

Zero shot learning

means to be able to predict classes that the machine learner has never seen at training time by using some auxiliary information, for example textual.

Adversarial learning

means generally the attempt to corrupt a model after it has been deployed in order to achieve some unexpected (from the developers and normal users) behavior. Examples would be to trick models into false classification by disguising the input in some form, re-engineering the model by learning from in- and output behavior or influencing the training data to harm the model.

Active learning

means that the machine learner itself is asking actively a so-called teacher (or oracle), often a human labeler, to annotate samples it is unsure of.

Bias vs. variance

means the trade-off between generalization (high bias, underfitting) and specification (high variance, overfitting). You can either
a) have simple models, like e.g. linear regression classifiers, that will treat every input with a similar strong bias (wrong decisions), irrespective of the training set, or
b) very complex models (e.g. a neural net with many layers) that will be more exact but very specific to your training data

Sources

I'd like to reveal some of my sources, much indebted:

Bio

Prof. Dr. Felix Burkhardt does teaching, consulting, research and development on speech communication, human-machine dialog systems, text-to-speech synthesis, speaker classification, ontology based natural language modeling and emotional human-machine interfaces.

Originally an expert of Speech Synthesis at the Technical University of Berlin, he wrote his ph.d. thesis on the simulation of emotional speech by machines, recorded the Berlin acted emotions database, "EmoDB", and maintains several open source projects, including the emotional speech synthesizer "Emofilt" and the speech labeling and annotation tool "Speechalyzer". Since 2018 he is the research director at audEERING after having worked for the Deutsche Telekom AG for 18 years. In addition he's currently a full professor at the institute of communication science of the Technical University of Berlin.

He was a member of the European Network of Excellence HUMAINE on emotion-oriented computing and is the editor of the W3C Emotion Markup Language specification and serves the program committee for numerous conferences including ACII, AVEC, EmoSPACE, FLAIR, IASTED CI, ICASSP, ICMI, ICPhS, Interspeech, IVA, IWSDS, LREC, Paraling, Prosico, SLSP, WS3P, journals: Specom, CSL, JASA, EURASIP, SIGPRO, IEEE-TAFFC, IEEE-TMM, IEEE-TASL, IEEE-TIP, IEEE-ISSI, IJSE, ETRI, Journal of Phonetics, Neural Processing Letters, UMUAI, and publishers: Wiley