
# Chapter 4: Small or noisy data: forms of weak supervision

NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax

$\newcommand{\epsi}{\varepsilon}$

Overview:

I - Small data II - Few labeled examples: forms of weak supervision III - Noisy data

Intro:

Classical setting: supervised training of neural networks $\implies$ requires quantities of labeled data
What about small data? big data with few labeled samples? big but noisy data?

## I - Small data

### Data augmentation

• Add various transformations to real data (possibly random, not using exactly the same transformation twice on the same sample, in order not to learn to biases due to the transformations)
• eg: for image classification tasks: rotations, flips, contrast, color balance, noise...
• or also: smooth deformations (replacing image $A$ with $A \circ f$)
• or use a simulator

Hoping that features useful for another, related task on the same data, will be also useful for the task of interest
• If too unrelated tasks: training both with the same network will decrease result quality (as the tasks will fight each other to get relevant features for them)
• If related enough tasks: increase in accuracy for the real targeted task (as some features useful for the auxiliary task are useful for it also)
• Example: on a satellite image to cadaset map registration task, learn also to segment images $\implies$ both tasks get better accuracies when trained together

### Transfer learning

Train first (or simultaneously) on another (bigger, or with more available labels) dataset for the same task (or a similar one)
Ex: for a medical image segmentation task, with few scans available:
• pick VGG already trained on ImageNet,
• replace the last layers by your own,
• $\Leftrightarrow$ considering the before-last layer of VGG as features
• $\implies$ surprisingly, sometimes, such seemingly-different tasks are in fact close enough for this to work (in both cases, develop features for image analysis)

Expected effect of transfer learning:

## II - Few labeled examples: forms of weak supervision

### Semi-supervision

Case where some samples (generally: few) are labeled, but many other, unlabeled samples are available.
This occurs when labeling is costly (e.g., requires expertise, or time...).
Examples of techniques:
• common unsupervised learning for all data (get clusters, features...) then train supervisedly;
• train supervisedly and label some of the unlabeled samples; iterate (issue: might make mistakes, that will get emphasized with time);
• train supervisedly but check classifier properties on whole dataset (bias, density, margin...)

### Weak supervision

More general setting: idem, but in plus labels could be noisy
(cf next section)

### Self-supervision

Unsupervised manner to pre-train a model, with an ad-hoc supervised task designed so that labels are directly provided by the data itself
Ex:

### Active learning

Same setting as semi-supervision, except that one can ask to label some samples. This is costly, so one would like to train the model to reach some target accuracy with as few samples as possible. The question is then, iteratively, to pick the right examples to label, that will improve the model the most.
Formally:
• a large dataset $(x_1, x_2... x_n)$ is given
• with labels known only for a small quantity of samples: $(y_1, y_2... y_p)$ with $p \ll n$
• which $x_i$ (i.e. which $i \in ]p, n]$) to pick and ask to sample?
• one can be based on the predictions by the current model for all unlabeled samples: $\hat{y}_i$, which, in a classification task case, are probability vectors for each sample $i$: $\hat{y}_i = (\hat{y}^c_i)_{c \in C}$ where $C$ is the set of classes.

Example of methods:$\DeclareMathOperator*{\argmax}{arg\,max}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

#### Local methods

(quantifying the impact of the choice over the sample chosen only)

• Uncertainty sampling:
pick the sample $x_i$ for which the model is the most uncertain, i.e. with lowest prediction confidence: $$\argmin_i\;\; \sup_{c \in C}\;\; \hat{y}^c_i$$
• Margin sampling:
quantity uncertainty by the difference between the highest classes probabilities: $$\argmin_i\;\; \hat{y}^{c1}_i - \hat{y}^{c2}_i$$ where $c1$ and $c2$ are the two most probable classes for sample $i$, i.e. $c1 = \argmax_c \hat{y}^{c}_i\;$ and $\;c2 = \argmax_{c \neq c1} \hat{y}^{c}_i$
• Entropy sampling:
entropy is maximum when the probability distribution $\hat{y}^c_i$ over classes $c$ is uniform, i.e. most uncertain to pick a class; while minimum when a Dirac peak on a class, i.e. most certain
$$\argmax_i\;\; H(\hat{y}^{c}_i) \;\;\;\text{ where } H \text{ is the entropy: } \;\; H(\hat{y}^{c}_i) = - \sum_c \hat{y}^{c}_i \log \hat{y}^{c}_i$$
• Query by committee
Suppose your model is actually an ensemble of $K$ models $m_k$ (for $k \in [1, K]$), trained on the same data, but possibly making different predictions $y_{i,k}$.
A possible way to quantify uncertainty for one sample $x_i$ is to check how much the predictions $y_{i,k}$ differ for the different models.
Then, select the point $x_i$ for which the models disagree the most.

#### Global methods

(quantifying the impact of the choice over all dataset samples)

• Expected model change
Which new labeled sample would change the current model the most, if doing one supplementary gradient descent step on the parameters $\theta$ of the model to minimize its prediction error?
$\implies$ which sample would induce the biggest variation of model parameters?
$\rightarrow$ approximated as the norm of the gradient of the loss for that sample
with expectation over probabilities of what that label could be:
$$\argmax_i\;\; \sum_c \; \hat{y}^{c}_i \; \|\nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)\|$$ noting that $\hat{y}^c_i$ depends on $\theta$, and denoting by $\delta_c$ the Dirac peak on class $c$.
• Expected error or uncertainty reduction
How much is the error (or uncertainty) prediction for all samples reduced if we re-train with that supplementary label?
$$\argmin_i\;\; \sum_{c \in C} \; \hat{y}^c_i \; \sum_j \text{error prediction for } x_j \text{ if trained with } (x_i, y^c_i) \text{ also}$$ All previous possible ways to quantify uncertainty or error...
Retrain model, or get the gradient of the error prediction, i.e. e.g. $\nabla_\theta \, \text{Loss}(\hat{y}^{c'}_j, \delta_{c'})$, averaged over possible classes $c'$ for $x_j$, averaged over samples $x_j$, and multiplied by the parameter shift induced by the new sample $-\nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)$: $$\argmax_i\;\; \sum_{c \in C} \; \hat{y}^c_i \; \sum_j \; \sum_{c' \in C} \hat{y}^{c'}_j \; \nabla_\theta \, \text{Loss}(\hat{y}^{c'}_j, \delta_{c'}) \;\cdot\; \nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)$$ $\implies$ lots of computations but still linear in the dataset size
NB: this is precisely the notion of similarity we will study in the noisy data section

• Density-Weighted Methods
Searching for samples representatives of many other ones (while on the opposite, most uncertain sample might be irrelevant for other samples)
$$\argmax_i\;\; [\text{information brought by } (x_i, y_i) ] \times \sum_j \text{similarity}(x_i,x_j)$$ where "information brought" denotes any of the previous methods (to quantity uncertainty or error at $x_i$) and "similarity" is defined appropriately
NB: with the right choices, this can actually boil down to the same formula as the previous method.

## III - Noisy data

### Denoising auto-encoder

Dealing with noisy data can be sometimes be seen as noise modeling.
[Extracting and composing robust features with denoising autoencoders; P. Vincent, H. Larochelle Y. Bengio and P.A. Manzagol; ICML 2008]
• originally meant to robustify auto-encoders (without requiring few middle neurons)
• get noisy inputs (in the article, artificial additional noise), ask for reconstruction without noise
• learns to get robust features and to denoise

### Classification with noisy labels

What if some samples are mislabeled in the ground truth? $\rightarrow$ not much change of accuracy
What if a significant proportion of samples are mislabeled? $\rightarrow$ still possible to train and get good results!
What if most samples (90% or 99%) of samples are mislabeled? $\rightarrow$ still possible to get reasonable results! provided data is available in large quantities (e.g., 10 times more data if 10% well labeled in the ground truth, 100 times more if 1%...)
[Deep Learning is Robust to Massive Label Noise; David Rolnick, Andreas Veit, Serge Belongie, Nir Shavit; 2017]

### Regression with noisy labels

What if noise is already present in the dataset, but unknown? (no denoised target available)
• data self-realignment : possible?
• example: registering RGB satellite images to cadaster maps (binary pictures indicating building presence): no exact ground truth available (spurious deformations in data acquisition & elevation landscape / human mistakes in making cadasters)
• in the case of a regression task : $x \mapsto y$ with noisy dataset samples $(x,y)$ :
• if output noise (on $y$) is ~i.i.d. (does not depend on x)
or more exactly: if "relatively similar" $x, x'$, which should have the same true label, have independent noises on two labels $y, y'$
then noise is unbiased
• Extreme case of only one possible input $x$, seen many times with different noisy labels $y_i$:
if loss = $L_2$ metric : $\sum_i \| \hat{y}(x) - y_i \|^2$ with $y_i = y \;\text{real}$ + unknown $\text{noise}_i$
then the optimal $\hat{y}(x)$ is reached at the mean of $y_i$, i.e. $y$ real $\pm O(\frac{1}{\sqrt{\text{number of samples}}})$
[Noise2Noise: Learning Image Restoration without Clean Data; Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, Timo Aila; 2018]
• General case of different inputs: requires a notion of similarity between any $x$ and $x'$
$\rightarrow$ defined and studied in [Input similarity from the neural network perspective; G. Charpiat, N. Girard, L. Felardos and Y. Tarabalka; NeurIPS 2019]
$\implies$ similarity between $x$ and $x'$: based on $\nabla_\theta\, \hat{y}(x) \,\cdot\, \nabla_\theta\, \hat{y}(x')$ (how much a variation of parameters meant to change the labels of point $x$ would affect the labels of point $x'$ too) $\implies$ gives the expected magnitude of the denoising effect for a given trained neural network, without knowing true labels
$\implies$ linked to the Neural Tangent Kernel (NTK)
• in practice: denoise the dataset this way (here, re-align) and then re-learn from the aligned dataset (as less noise $\implies$ better results [theoretically same global optimum, but with different confidence (= variance)])

Back to the main page of the course