$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\R}{\mathbb{R}}$

Deep Learning in Practice

Chapter 4: Small or noisy data: forms of weak supervision


NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: this page is still being updated

$\newcommand{\epsi}{\varepsilon}$

Overview:

I - Small data II - Few labeled examples: forms of weak supervision III - Noisy data



Intro:

Classical setting: supervised training of neural networks $\implies$ requires quantities of labeled data
What about small data? big data with few labeled samples? big but noisy data?

I - Small data

Data augmentation

Multi-tasking

Hoping that features useful for another, related task on the same data, will be also useful for the task of interest

Transfer learning

Train first (or simultaneously) on another (bigger, or with more available labels) dataset for the same task (or a similar one)
Ex: for a medical image segmentation task, with few scans available:
Expected effect of transfer learning:

II - Few labeled examples: forms of weak supervision

Semi-supervision

Case where some samples (generally: few) are labeled, but many other, unlabeled samples are available.
This occurs when labeling is costly (e.g., requires expertise, or time...).
Examples of techniques:

Weak supervision

More general setting: idem, but in plus labels could be noisy
(cf next section)

Self-supervision

Unsupervised manner to pre-train a model, with an ad-hoc supervised task designed so that labels are directly provided by the data itself
Ex:
Adding teacher-student approaches on top:

Active learning

Same setting as semi-supervision, except that one can ask to label some samples. This is costly, so one would like to train the model to reach some target accuracy with as few samples as possible. The question is then, iteratively, to pick the right examples to label, that will improve the model the most.
Formally:
Example of methods:$\DeclareMathOperator*{\argmax}{arg\,max}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

Local methods

(quantifying the impact of the choice over the sample chosen only)


Global methods

(quantifying the impact of the choice over all dataset samples)

III - Noisy data

Denoising auto-encoder

Dealing with noisy data can be sometimes be seen as noise modeling.
[Extracting and composing robust features with denoising autoencoders; P. Vincent, H. Larochelle Y. Bengio and P.A. Manzagol; ICML 2008]

Classification with noisy labels

What if some samples are mislabeled in the ground truth? $\rightarrow$ not much change of accuracy
What if a significant proportion of samples are mislabeled? $\rightarrow$ still possible to train and get good results!
What if most samples (90% or 99%) of samples are mislabeled? $\rightarrow$ still possible to get reasonable results! provided data is available in large quantities (e.g., 10 times more data if 10% well labeled in the ground truth, 100 times more if 1%...)
[Deep Learning is Robust to Massive Label Noise; David Rolnick, Andreas Veit, Serge Belongie, Nir Shavit; 2017]

Regression with noisy labels

What if noise is already present in the dataset, but unknown? (no denoised target available)







Back to the main page of the course

Valid HTML 4.0 Transitional