Classical setting: supervised training of neural networks $\implies$ requires quantities of labeled data
What about small data? big data with few labeled samples? big but noisy data?
I - Small data
Data augmentation
Add various transformations to real data (possibly random, not using exactly the same transformation twice on the same sample, in order not to learn to biases due to the transformations)
eg: for image classification tasks: rotations, flips, contrast, color balance, noise...
or also: smooth deformations (replacing image $A$ with $A \circ f$)
take 2 samples, copy-paste a rectangular part of one over the other one, and train with a linear combination of the 2 losses (with the 2 outputs), weighted by area ratio, or ask to predict a linear combination of the 2 outputs (weighted by image area as well)
$\implies$ learn to ground decision confidence on evidence proportions (size of image part that contains the object)
or use a simulator
Multi-tasking
Hoping that features useful for another, related task on the same data, will be also useful for the task of interest
If too unrelated tasks: training both with the same network will decrease result quality (as the tasks will fight each other to get relevant features for them)
If related enough tasks: increase in accuracy for the real targeted task (as some features useful for the auxiliary task are useful for it also)
Example: on a satellite image to cadaset map registration task, learn also to segment images $\implies$ both tasks get better accuracies when trained together
Transfer learning
Train first (or simultaneously) on another (bigger, or with more available labels) dataset for the same task (or a similar one)
Ex: for a medical image segmentation task, with few scans available:
pick VGG already trained on ImageNet,
replace the last layers by your own,
train them for your own task on your dataset
$\Leftrightarrow$ considering the before-last layer of VGG as features
$\implies$ surprisingly, sometimes, such seemingly-different tasks are in fact close enough for this to work (in both cases, develop features for image analysis)
Expected effect of transfer learning:
if very small data, big help in getting relevant features
LoRA (Low-Rank Adaptation): do not tune whole matrices (i.e. not all coefficients) but add just low-rank matrices to them
case of LLMs: learn a new token, inserted before all sentences for this task
prompting
RAG
Note that training an LLMs is very costly, but fine-tuning it is cheap, both in number of samples and in time/energy.
NB: fine-tuning just a few parameters is faster + prevents overfitting
Issue of transfer learning from large generic models: need to compress the obtained model to reduce energy and tiem costs
II - Few labeled examples: forms of weak supervision
Semi-supervision
Case where some samples (generally: few) are labeled, but many other, unlabeled samples are available.
This occurs when labeling is costly (e.g., requires expertise, or time...).
Examples of techniques:
common unsupervised learning for all data (get clusters, features...) then train supervisedly;
train supervisedly and label some of the unlabeled samples; iterate (issue: might make mistakes, that will get emphasized with time);
train supervisedly but check classifier properties on whole dataset (bias, density, margin...)
Weak supervision
More general setting: idem, but in plus labels could be noisy
(cf next section)
Self-supervision
Unsupervised manner to pre-train a model, with an ad-hoc supervised task designed so that labels are directly provided by the data itself
Ex:
DINO, from [Emerging Properties in Self-Supervised Vision Transformers; Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin; ICCV 2021] : consider a randomly initialized network (the student), and a copy of it (the teacher). During training, the teacher weights will be a moving average of the student weights (i.e. its weights at the last time steps t, t-1, t-2...). The teacher will thus not be trained at all. Each sample is augmented (rotations, etc.) to produce several "views"; the student is applied to (cropped versions of) such augmented data and is asked to try to reproduce the teacher output on (another, non-cropped view of) the same sample. Thus the student tries to be invariant to the noise due to augmentation, and tries to infer global properties from more local ones (infer the output for a full, non-cropped sample when seeing a cropped version of it). Supplementary considerations are needed to avoid model collapse (e.g., just outputting always 0). This unsupervised pretraining, followed by the training of just a linear classifier or k-NN, yields performance on ImageNet classification close to supervised training.
Active learning
Same setting as semi-supervision, except that one can ask to label some samples. This is costly, so one would like to train the model to reach some target accuracy with as few samples as possible.
The question is then, iteratively, to pick the right examples to label, that will improve the model the most.
Formally:
a large dataset $(x_1, x_2... x_n)$ is given
with labels known only for a small quantity of samples: $(y_1, y_2... y_p)$ with $p \ll n$
which $x_i$ (i.e. which $i \in ]p, n]$) to pick and ask to sample?
one can be based on the predictions by the current model for all unlabeled samples: $\hat{y}_i$, which, in a classification task case, are probability vectors for each sample $i$: $\hat{y}_i = (\hat{y}^c_i)_{c \in C}$ where $C$ is the set of classes.
Example of methods:$\DeclareMathOperator*{\argmax}{arg\,max}$ $\DeclareMathOperator*{\argmin}{arg\,min}$
Local methods
(quantifying the impact of the choice over the sample chosen only)
Uncertainty sampling:
pick the sample $x_i$ for which the model is the most uncertain, i.e. with lowest prediction confidence:
$$\argmin_i\;\; \sup_{c \in C}\;\; \hat{y}^c_i$$
Margin sampling:
quantity uncertainty by the difference between the highest classes probabilities:
$$\argmin_i\;\; \hat{y}^{c1}_i - \hat{y}^{c2}_i $$
where $c1$ and $c2$ are the two most probable classes for sample $i$, i.e. $c1 = \argmax_c \hat{y}^{c}_i\;$ and $\;c2 = \argmax_{c \neq c1} \hat{y}^{c}_i$
Entropy sampling:
entropy is maximum when the probability distribution $\hat{y}^c_i$ over classes $c$ is uniform, i.e. most uncertain to pick a class; while minimum when a Dirac peak on a class, i.e. most certain
$$\argmax_i\;\; H(\hat{y}^{c}_i) \;\;\;\text{ where } H \text{ is the entropy: } \;\; H(\hat{y}^{c}_i) = - \sum_c \hat{y}^{c}_i \log \hat{y}^{c}_i $$
Query by committee
Suppose your model is actually an ensemble of $K$ models $m_k$ (for $k \in [1, K]$),
trained on the same data, but possibly making different predictions $y_{i,k}$.
A possible way to quantify uncertainty for one sample $x_i$ is to check how much the predictions $y_{i,k}$ differ for the different models.
Then, select the point $x_i$ for which the models disagree the most.
Global methods
(quantifying the impact of the choice over all dataset samples)
Expected model change
Which new labeled sample would change the current model the most, if doing one supplementary gradient descent step on the parameters $\theta$ of the model to minimize its prediction error?
$\implies$ which sample would induce the biggest variation of model parameters?
$\rightarrow$ approximated as the norm of the gradient of the loss for that sample
with expectation over probabilities of what that label could be:
$$\argmax_i\;\; \sum_c \; \hat{y}^{c}_i \; \|\nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)\|$$
noting that $\hat{y}^c_i$ depends on $\theta$, and denoting by $\delta_c$ the Dirac peak on class $c$.
Expected error or uncertainty reduction
How much is the error (or uncertainty) prediction for all samples reduced if we re-train with that supplementary label?
$$\argmin_i\;\; \sum_{c \in C} \; \hat{y}^c_i \; \sum_j \text{error prediction for } x_j \text{ if trained with } (x_i, y^c_i) \text{ also}$$
All previous possible ways to quantify uncertainty or error...
Retrain model, or get the gradient of the error prediction, i.e. e.g. $\nabla_\theta \, \text{Loss}(\hat{y}^{c'}_j, \delta_{c'})$, averaged over possible classes $c'$ for $x_j$, averaged over samples $x_j$, and multiplied by the parameter shift induced by the new sample $-\nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)$:
$$\argmax_i\;\; \sum_{c \in C} \; \hat{y}^c_i \; \sum_j \; \sum_{c' \in C} \hat{y}^{c'}_j \; \nabla_\theta \, \text{Loss}(\hat{y}^{c'}_j, \delta_{c'}) \;\cdot\; \nabla_\theta \, \text{Loss}(\hat{y}^c_i, \delta_c)$$
$\implies$ lots of computations but still linear in the dataset size
NB: this is precisely the notion of similarity we will study in the noisy data section
Density-Weighted Methods
Searching for samples representatives of many other ones (while on the opposite, most uncertain sample might be irrelevant for other samples)
$$\argmax_i\;\; [\text{information brought by } (x_i, y_i) ] \times \sum_j \text{similarity}(x_i,x_j)$$
where "information brought" denotes any of the previous methods (to quantity uncertainty or error at $x_i$) and "similarity" is defined appropriately
NB: with the right choices, this can actually boil down to the same formula as the previous method.
originally meant to robustify auto-encoders (without requiring few middle neurons)
get noisy inputs (in the article, artificial additional noise), ask for reconstruction without noise
learns to get robust features and to denoise
Classification with noisy labels
What if some samples are mislabeled in the ground truth? $\rightarrow$ not much change of accuracy
What if a significant proportion of samples are mislabeled? $\rightarrow$ still possible to train and get good results!
What if most samples (90% or 99%) of samples are mislabeled? $\rightarrow$ still possible to get reasonable results! provided data is available in large quantities (e.g., 10 times more data if 10% well labeled in the ground truth, 100 times more if 1%...)
[Deep Learning is Robust to Massive Label Noise; David Rolnick, Andreas Veit, Serge Belongie, Nir Shavit; 2017]
Regression with noisy labels
What if noise is already present in the dataset, but unknown? (no denoised target available)
data self-realignment : possible?
example: registering RGB satellite images to cadaster maps (binary pictures indicating building presence): no exact ground truth available (spurious deformations in data acquisition & elevation landscape / human mistakes in making cadasters)
in the case of a regression task : $x \mapsto y$ with noisy dataset samples $(x,y)$ :
if output noise (on $y$) is ~i.i.d. (does not depend on x)
or more exactly: if "relatively similar" $x, x'$, which should have the same true label, have independent noises on two labels $y, y'$
then noise is unbiased
General case of different inputs: requires a notion of similarity between any $x$ and $x'$
$\rightarrow$ defined and studied in [Input similarity from the neural network perspective; G. Charpiat, N. Girard, L. Felardos and Y. Tarabalka; NeurIPS 2019]
$\implies$ similarity between $x$ and $x'$: based on $\nabla_\theta\, \hat{y}(x) \,\cdot\, \nabla_\theta\, \hat{y}(x')$ (how much a variation of parameters meant to change the labels of point $x$ would affect the labels of point $x'$ too)
$\implies$ gives the expected magnitude of the denoising effect for a given trained neural network, without knowing true labels
$\implies$ linked to the Neural Tangent Kernel (NTK)
in practice: denoise the dataset this way (here, re-align) and then re-learn from the aligned dataset (as less noise $\implies$ better results [theoretically same global optimum, but with different confidence (= variance)])