$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\R}{\mathbb{R}}$

Deep Learning in Practice

Chapter 4: Small or noisy data: forms of weak supervision

NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: this page is still being updated

$\newcommand{\epsi}{\varepsilon}$

Overview:

I - Small data

Data augmentation
Multi-tasking
Transfer learning

II - Few labeled examples: forms of weak supervision

Semi-supervision
Weak supervision
Self-supervision
Active learning
- Local methods
- Global methods

III - Noisy data

Denoising auto-encoder
Classification with noisy labels
Regression with noisy labels

Intro:

Classical setting: supervised training of neural networks $\implies$ requires quantities of labeled data
What about small data? big data with few labeled samples? big but noisy data?

I - Small data

Data augmentation

Add various transformations to real data (possibly random, not using exactly the same transformation twice on the same sample, in order not to learn to biases due to the transformations)
- eg: for image classification tasks: rotations, flips, contrast, color balance, noise...
- or also: smooth deformations (replacing image $A$ with $A \circ f$)
- more surprising: CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features ; Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo ; ICCV 2019:
  - take 2 samples, copy-paste a rectangular part of one over the other one, and train with a linear combination of the 2 losses (with the 2 outputs), weighted by area ratio, or ask to predict a linear combination of the 2 outputs (weighted by image area as well)
  - $\implies$ learn to ground decision confidence on evidence proportions (size of image part that contains the object)
or use a simulator

Multi-tasking

Hoping that features useful for another, related task on the same data, will be also useful for the task of interest

If too unrelated tasks: training both with the same network will decrease result quality (as the tasks will fight each other to get relevant features for them)
If related enough tasks: increase in accuracy for the real targeted task (as some features useful for the auxiliary task are useful for it also)
Example: on a satellite image to cadaset map registration task, learn also to segment images $\implies$ both tasks get better accuracies when trained together

Transfer learning

Train first (or simultaneously) on another (bigger, or with more available labels) dataset for the same task (or a similar one)
Ex: for a medical image segmentation task, with few scans available:

pick VGG already trained on ImageNet,
replace the last layers by your own,
train them for your own task on your dataset
$\Leftrightarrow$ considering the before-last layer of VGG as features
$\implies$ surprisingly, sometimes, such seemingly-different tasks are in fact close enough for this to work (in both cases, develop features for image analysis)

Expected effect of transfer learning:

if very small data, big help in getting relevant features
if not that small data: big boost in training time, but not necessarily any accuracy gain if trained until convergence:
[Rethinking ImageNet pre-training; Kaiming He, Ross Girshick, Piotr Dollár; ICCV 2019]

Other ways to fine-tune:

LoRA (Low-Rank Adaptation): do not tune whole matrices (i.e. not all coefficients) but add just low-rank matrices to them
case of LLMs: learn a new token, inserted before all sentences for this task
prompting
RAG
Note that training an LLMs is very costly, but fine-tuning it is cheap, both in number of samples and in time/energy.

NB: fine-tuning just a few parameters is faster + prevents overfitting

Issue of transfer learning from large generic models: need to compress the obtained model to reduce energy and tiem costs

II - Few labeled examples: forms of weak supervision

Semi-supervision

Case where some samples (generally: few) are labeled, but many other, unlabeled samples are available.
This occurs when labeling is costly (e.g., requires expertise, or time...).
Examples of techniques:

common unsupervised learning for all data (get clusters, features...) then train supervisedly;
train supervisedly and label some of the unlabeled samples; iterate (issue: might make mistakes, that will get emphasized with time);
train supervisedly but check classifier properties on whole dataset (bias, density, margin...)

Weak supervision

More general setting: idem, but in plus labels could be noisy
(cf next section)

Self-supervision

Unsupervised manner to pre-train a model, with an ad-hoc supervised task designed so that labels are directly provided by the data itself
Ex:

for an image classification task:
image puzzle : cut each image into 9 pieces (3x3 grid), then take two pieces randomly (of the same image), and ask their relative location
[Unsupervised Visual Representation Learning by Context Prediction; Carl Doersch, Abhinav Gupta, Alexei A. Efros; ICCV 2015]
rotate images by random angles and ask, for each given rotated image, to retrieve that angle
[Unsupervised Representation Learning by Predicting Image Rotations; Spyros Gidaris, Praveer Singh, Nikos Komodakis; ICLR 2018]
define many new very specific classes as one per sample in the training set; add random transformations to images and ask which were the original images in the training set, i.e. predict the specific class
[Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks; Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, Thomas Brox; NIPS 2014]
for a video classification task:
predict next video frame, or, weaker: can this frame lie between these 2?

Adding teacher-student approaches on top:

[Clusterfit: Improving Generalization of Visual Representations; Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, Dhruv Mahajan; CVPR 2020] : pre-trained network (for a dummy self-supervised task) $\rightarrow$ get features from it (activities in a certain layer) $\rightarrow$ cluster $\rightarrow$ make fake labels $\rightarrow$ train new network for that. More robust to features specific to the dummy task.
DINO, from [Emerging Properties in Self-Supervised Vision Transformers; Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin; ICCV 2021] : consider a randomly initialized network (the student), and a copy of it (the teacher). During training, the teacher weights will be a moving average of the student weights (i.e. its weights at the last time steps t, t-1, t-2...). The teacher will thus not be trained at all. Each sample is augmented (rotations, etc.) to produce several "views"; the student is applied to (cropped versions of) such augmented data and is asked to try to reproduce the teacher output on (another, non-cropped view of) the same sample. Thus the student tries to be invariant to the noise due to augmentation, and tries to infer global properties from more local ones (infer the output for a full, non-cropped sample when seeing a cropped version of it). Supplementary considerations are needed to avoid model collapse (e.g., just outputting always 0). This unsupervised pretraining, followed by the training of just a linear classifier or k-NN, yields performance on ImageNet classification close to supervised training.

Active learning

Same setting as semi-supervision, except that one can ask to label some samples. This is costly, so one would like to train the model to reach some target accuracy with as few samples as possible. The question is then, iteratively, to pick the right examples to label, that will improve the model the most.
Formally:

a large dataset $(x_1, x_2... x_n)$ is given
with labels known only for a small quantity of samples: $(y_1, y_2... y_p)$ with $p \ll n$
which $x_i$ (i.e. which $i \in ]p, n]$) to pick and ask to sample?
one can be based on the predictions by the current model for all unlabeled samples: $\hat{y}_i$, which, in a classification task case, are probability vectors for each sample $i$: $\hat{y}_i = (\hat{y}^c_i)_{c \in C}$ where $C$ is the set of classes.

Example of methods:$\DeclareMathOperator*{\argmax}{arg\,max}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

Local methods

(quantifying the impact of the choice over the sample chosen only)

Uncertainty sampling:
pick the sample $x_i$ for which the model is the most uncertain, i.e. with lowest prediction confidence: $$\argmin_i\;\; \sup_{c \in C}\;\; \hat{y}^c_i$$
Margin sampling:
quantity uncertainty by the difference between the highest classes probabilities: $$\argmin_i\;\; \hat{y}^{c1}_i - \hat{y}^{c2}_i $$ where $c1$ and $c2$ are the two most probable classes for sample $i$, i.e. $c1 = \argmax_c \hat{y}^{c}_i\;$ and $\;c2 = \argmax_{c \neq c1} \hat{y}^{c}_i$
Entropy sampling:
entropy is maximum when the probability distribution $\hat{y}^c_i$ over classes $c$ is uniform, i.e. most uncertain to pick a class; while minimum when a Dirac peak on a class, i.e. most certain
$$\argmax_i\;\; H(\hat{y}^{c}_i) \;\;\;\text{ where } H \text{ is the entropy: } \;\; H(\hat{y}^{c}_i) = - \sum_c \hat{y}^{c}_i \log \hat{y}^{c}_i $$
Query by committee
Suppose your model is actually an ensemble of $K$ models $m_k$ (for $k \in [1, K]$), trained on the same data, but possibly making different predictions $y_{i,k}$.
A possible way to quantify uncertainty for one sample $x_i$ is to check how much the predictions $y_{i,k}$ differ for the different models.
Then, select the point $x_i$ for which the models disagree the most.

Global methods

(quantifying the impact of the choice over all dataset samples)

Expected model change
Which new labeled sample would change the current model the most, if doing one supplementary gradient descent step on the parameters $\theta$ of the model to minimize its prediction error?
$\implies$ which sample would induce the biggest variation of model parameters?
$\rightarrow$ approximated as the norm of the gradient of the loss for that sample
with expectation over probabilities of what that label could be:
$$\argmax_i\;\; \sum_c \; \hat{y}^{c}_i \; \|\nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)\|$$ noting that $\hat{y}^c_i$ depends on $\theta$, and denoting by $\delta_c$ the Dirac peak on class $c$.
Expected error or uncertainty reduction
How much is the error (or uncertainty) prediction for all samples reduced if we re-train with that supplementary label?
$$\argmin_i\;\; \sum_{c \in C} \; \hat{y}^c_i \; \sum_j \text{error prediction for } x_j \text{ if trained with } (x_i, y^c_i) \text{ also}$$ All previous possible ways to quantify uncertainty or error...
Retrain model, or get the gradient of the error prediction, i.e. e.g. $\nabla_\theta \, \text{Loss}(\hat{y}^{c'}_j, \delta_{c'})$, averaged over possible classes $c'$ for $x_j$, averaged over samples $x_j$, and multiplied by the parameter shift induced by the new sample $-\nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)$: $$\argmax_i\;\; \sum_{c \in C} \; \hat{y}^c_i \; \sum_j \; \sum_{c' \in C} \hat{y}^{c'}_j \; \nabla_\theta \, \text{Loss}(\hat{y}^{c'}_j, \delta_{c'}) \;\cdot\; \nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)$$ $\implies$ lots of computations but still linear in the dataset size
NB: this is precisely the notion of similarity we will study in the noisy data section
Density-Weighted Methods
Searching for samples representatives of many other ones (while on the opposite, most uncertain sample might be irrelevant for other samples)
$$\argmax_i\;\; [\text{information brought by } (x_i, y_i) ] \times \sum_j \text{similarity}(x_i,x_j)$$ where "information brought" denotes any of the previous methods (to quantity uncertainty or error at $x_i$) and "similarity" is defined appropriately
NB: with the right choices, this can actually boil down to the same formula as the previous method.

III - Noisy data

Denoising auto-encoder

Dealing with noisy data can be sometimes be seen as noise modeling.
[Extracting and composing robust features with denoising autoencoders; P. Vincent, H. Larochelle Y. Bengio and P.A. Manzagol; ICML 2008]

originally meant to robustify auto-encoders (without requiring few middle neurons)
get noisy inputs (in the article, artificial additional noise), ask for reconstruction without noise
learns to get robust features and to denoise

Classification with noisy labels

What if some samples are mislabeled in the ground truth? $\rightarrow$ not much change of accuracy
What if a significant proportion of samples are mislabeled? $\rightarrow$ still possible to train and get good results!
What if most samples (90% or 99%) of samples are mislabeled? $\rightarrow$ still possible to get reasonable results! provided data is available in large quantities (e.g., 10 times more data if 10% well labeled in the ground truth, 100 times more if 1%...)
[Deep Learning is Robust to Massive Label Noise; David Rolnick, Andreas Veit, Serge Belongie, Nir Shavit; 2017]

Regression with noisy labels

What if noise is already present in the dataset, but unknown? (no denoised target available)

data self-realignment : possible?
example: registering RGB satellite images to cadaster maps (binary pictures indicating building presence): no exact ground truth available (spurious deformations in data acquisition & elevation landscape / human mistakes in making cadasters)
- [Aligning and Updating Cadaster Maps with Aerial Images by Multi-Task, Multi-Resolution Deep Learning; Nicolas Girard, Guillaume Charpiat, Yuliya Tarabalka; ACCV 2018]
- [Noisy Supervision For Correcting Misaligned Cadaster Maps Without Perfect Ground Truth Data; Nicolas Girard, Guillaume Charpiat, Yuliya Tarabalka; 2019]
in the case of a regression task : $x \mapsto y$ with noisy dataset samples $(x,y)$ :
- if output noise (on $y$) is ~i.i.d. (does not depend on x)
  or more exactly: if "relatively similar" $x, x'$, which should have the same true label, have independent noises on two labels $y, y'$
  then noise is unbiased
- Extreme case of only one possible input $x$, seen many times with different noisy labels $y_i$:
  if loss = $L_2$ metric : $\sum_i \| \hat{y}(x) - y_i \|^2$ with $y_i = y \;\text{real}$ + unknown $\text{noise}_i$
  then the optimal $\hat{y}(x)$ is reached at the mean of $y_i$, i.e. $y$ real $\pm O(\frac{1}{\sqrt{\text{number of samples}}})$
  [Noise2Noise: Learning Image Restoration without Clean Data; Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, Timo Aila; 2018]
- General case of different inputs: requires a notion of similarity between any $x$ and $x'$
  $\rightarrow$ defined and studied in [Input similarity from the neural network perspective; G. Charpiat, N. Girard, L. Felardos and Y. Tarabalka; NeurIPS 2019]
  $\implies$ similarity between $x$ and $x'$: based on $\nabla_\theta\, \hat{y}(x) \,\cdot\, \nabla_\theta\, \hat{y}(x')$ (how much a variation of parameters meant to change the labels of point $x$ would affect the labels of point $x'$ too) $\implies$ gives the expected magnitude of the denoising effect for a given trained neural network, without knowing true labels
  $\implies$ linked to the Neural Tangent Kernel (NTK)
- in practice: denoise the dataset this way (here, re-align) and then re-learn from the aligned dataset (as less noise $\implies$ better results [theoretically same global optimum, but with different confidence (= variance)])

Back to the main page of the course