Chapter 1: Deep learning vs. classical machine learning and optimization
NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax NB2: an old raw text file is also available if you wish for a more compact summary
No guarantee (beforehand) that the training (with such architecture, criterion to optimize) will lead to a good solution
Uselessness of universal approximation theorems (expressive power)
Main universal approximation theorems:
[George Cybenko, 1989; Kurt Hornik, 1991; Wikipedia summary]: one hidden layer is sufficient to approximate any continuous function (defined on a compact set such as $[0,1]^d$) arbitrarily precisely if many neurons enough (whatever continuous non-constant activation function is used)
+ an exact representation theorem by Kolmogorov [1956] (exact decomposition of any continuous function with a finite number of neurons with specific activation functions) : but with completely discontinuous proof (hashing)
These theorems are about possibilities to learn "by heart" a function, given many samples enough (i.e. a dense sampling of all possible inputs), but do not tell anything about possible generalization properties to new samples (learnability of a function?)
Example: would need an infinite amount of samples to learn a periodic function on $\R$ (no generalization ability)
Depth simplifies the approximation/estimation task:
Why deep: examples of depth vs. layer size compromises with explicit bounds
technical preamble: a multiplication can be approximated with 4 neurons (performing sums), provided the activation function is smooth [developable]
then: multiplication of $n$ variables is possible with a binary tree of depth $\log n$, with $4 n$ nodes total
while they prove a flat network would require $\sum_{\mathrm{all}\; 2^n \;\mathrm{configurations}}$ just to be correct on binary numbers (each input is either 0 or 1) $\implies$ exponentially many nodes
Theorem: if function to estimate = computational tree of input variables, then required number of samples to learn a similar-shaped network: $d$ instead of $\exp d$
Examples of successes and failures of deep learning vs. classical techniques (random forests)
successes in computer vision: due to handling image properties well: explicit geometry handling, invariance to translation, hierarchical models (see Chapter 2)
random forest (e.g.), without geometry information (difficult to incorporate), would require massively more data to learn anything from images (no generalization power [from a variable, representing a pixel color, to the next one] without geometry knowledge: what is learned for one pixel has to be re-learned for all other pixels)
on the opposite: if no data structure : random forest can do the job; if small data: neural network won't be able to
success in RL:
as a function of processing power: running Alpha-Go/0 on a huge GPU farm for hours, having played zillions of games: is it a success?
44 million training games (not achievable during one life, sounds more like Humanity total), 4 hours on 5000 TPU = more than 2 years of single TPU = thousands of years on a single CPU?
same with NLP state-of-the-art huge networks (BERT, XLnet, etc.): may cost hundreds of thousands of dollars to train
what about environmental footprint?
Gap between classical Machine Learning and Deep Learning
Forgotten Machine Learning basics (Minimum Description Length principle, regularizers,
objective function different from evaluation criterion) and incidental palliatives (drop-out)
Reminder: ML setup
Given samples $(x_i, y_i)$,
estimate $f: x_i \mapsto \, \approx y_i$,
quantify goodness of output with a "loss function" $\mathrm{Loss}$,
where the regularizer makes sure that $f$ is regular enough (typically: smooth), hoping for generalization properties
Some ML / optimization basics seem to be not really true anymore (no overfit with millions of parameters?)
millions of parameters : what about Occam's razor (shortest or simplest explanation is the best) / Minimum Description Length ?
possible to train networks with millions of parameters without overfitting (still generalizes well, i.e. small gap between train and test error)
and this with fewer examples than parameters : estimator convergence?
highly non-convex optimization in a high-dimensional space : thought to be very hard and to be avoided
add noise in the optimization process, it helps (!)
train to optimize a criterion (cross-entropy), but evaluate on another one (accuracy)... (because the evaluation criterion is not differentiable or doesn't have good gradient properties)
common recommendation: when trying to solve a new task: check first that the model is able to overfit (!) (in contrast with picking a regularized search space)
they can completely overfit too (learn random labels) : shown theoretically & experimentally
theory: minimal network size for overfitting is very small (given the sample, to fit random labels):
Theorem: There exists a two-layer neural network with ReLU activations and $2n + d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.
i.e. whatever the labels to predict are, it is possible to tune this small network to output those labels on those $n$ samples
$\implies$ capacity is not the issue (Rademacher or Vapnik-Chervonenkis), rather control / regularization / telling which functions are preferable
how to know when overfitting or not? / how to prevent overfitting? [to be discussed later]
possible regularizers: weight decay, dropout, data augmentation → not clear, from experiments (not sufficient/necessary even though do help)
NP-hard to compute the functional norm, because NP-hard to estimate whether a neural net is the function 0... (example: if binary inputs/activities, determining whether output is always 0 is a SAT problem)
but stochastic approximation of the norm = tractable
Drop-out
At each time step, select randomly half of the neurons and temporarily drop them (replace with 0)
robustness to noise in neuron activity $a_i$ $\;\;\Longleftrightarrow\;\;$ requiring $\frac{df}{da_i} \simeq 0$
$\implies$ not far from an explicit regularizer $\|\nabla_x f\|$
each drop-out realization leads a different (sub-)network (since half of the nodes are missing)
optimize the expectation (of the loss) over possible drop-out realizations $\simeq$ ensemble method averaging the output of all possible sub-networks
Early stopping
i.e. when training, store a copy $M(t)$ of the neural network $M$ at each time step $t$, and in the end pick the one $M(t')$ that has the lowest validation error.
→ standard Machine Learning process to select the best model among a list of models
no time to get specialized too much (too far from generic initialization) into overfitting minima
same for weight decay then
optimization $\neq$ usual gradient descent till convergence, but rather just an optimization *process* (pile of recipes, selected manually by trial and error, to have a positive effect on neural network training)
Around convergence: if the Hessian $\left(\frac{d^2f}{dw^2}\right)$ is definite positive ("def $>0$"), then SGD noise does not harm:
def $>0$ $\implies$ convex problem: any noisy step away from the solution will be corrected later as it increases the loss
but if the Hessian is degenerated (positive non-definite: several 0 eigenvalues, i.e. flat landscape in some directions, i.e. a continuum of global minima):
in a batch approach, i.e. if minimizing $\sum_{\text{all samples}\; i} \mathrm{Loss}(f(x_i))$: noisy steps going away from an initial global solution will be corrected only if they increase the training error
noisy steps increasing the test error but not the training error won't be corrected
$\implies$ training error remains constant but test error increases with time
$\implies$ SGD overfits after a while
Consequently:
make the Hessian become def $>0$ : add a weight regularizer ($\|w\|_2$)
contrary to: weight normalization (of the connexion matrix between two layers) + RELU $\;\;\implies\;\;$ regularizes but not def $>0$
To understand this, need some more detailed optimization landscape analysis
Optimization landscape / properties [small break]
Local minima and saddle points
swapping neurons in a layer (i.e. changing the ordering) does not change the function computed (permutation invariance) $\implies$ lots of minima... (any local minimum has factorial($N$) copies, i.e. the number of possible permutations of neurons, or, actually, $\prod_l N_l!$ where $N_l$ is the number of neurons in layer $l$)
so many parameters $\implies$ local minima = very good (because huge neighborhood: derivative is 0 in many many directions) (different to classical optimization)
Spot a phenomenon known as "double gradient descent" in the case of training without regularization and without early stopping: for sufficiently-parameterized models, overfit increases with the number of parameters, but then decreases again!
Prove theoretically and experimentally that with a high-enough number $n$ of neurons, there is no optimization issue anymore, in the sense that there is convergence in $O\left(\frac{1}{\sqrt{n}}\right)$ to the average function over all possible initializations, which shows better generalization (as ensemble methods in general). NB: this does not say anything about the absolute generalization error of this average function.
$\implies$ when designing an architecture, and choosing the number of neurons, do not worry that too many neurons might cause overfit or optimization issues : on the opposite, overparameterization helps and reduces such issues (if sufficiently overparameterized). It might slow down computations, but not the quality of the result.
Same phenomenon, studied as a function of the number of training epochs: when training a large architecture (without regularization), overfit is followed by standard test error decrease again
and studied also a function of the number of samples (dataset size): for some fixed architectures and fixed number of training epochs, increasing the dataset size might lead to bigger overfit!
Link between deep neural networks and Hamiltonian of spherical spin-glass model, enabling theoretical statistics over critical points (saddle points, mimima, etc.)
$\implies$ most local minima are within a certain band (of loss value) ; the number of local minima outside that band diminishes exponentially with the size of the network.
$\implies$ major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered.
recovering the global minimum becomes harder as the network size increases, and this is in practice irrelevant as global minimum often leads to overfitting.
Quoting [Stanford class on CNN for Visual Recognition] : good summary about neural net complexity and performance : Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.
The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks. In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you'll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.
$\implies$ don't evaluate the complexity of the network (or its optimization) by its number of neurons (not a good proxy) : larger networks are easier to train (more local minima, but good ones, to the opposite of small architectures)
then, how?
ML fundamentals (MDL) are still there!
High redundancy inside neural networks
MDL principle is lost? number of parameters = huge, but:
no precision needed (in the encoding of parameters)
high redundancy (different parts of the network may compute similar quantities)
A good read / source of references:
"Deep Learning" book by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks
Reminder: Minimum Description Length (MDL)
Origin of the ML setup (theoretical justification, from information theory)
Occam's razor: shorter explanation = the best one, will generalize better to new examples
$\implies$ find the shortest program that produces $(y_i)$, being given $(x_i)$
Solomonoff's prior: proba$(f) = 2^{\,- \,\mathrm{Kolmogorov\; complexity}(f) }$ : defines a prior probability on functions (or programs) $f$ as reflecting its complexity ($\implies$ prefer simple functions)
usual approximation of Kolmogorov complexity :
Akaike information criterion (AIC): number $n$ of parameters
or Bayesian information criterion (BIC): $\; n \log N$, with $N$ = number of samples
in the Bayesian probabilistic setup: choose a prior over functions, e.g. prior$(f) \propto e^{- \|f\|}$ for some norm, and measure the likelihood of $f$ to produce the right answer: $p(y | f(x))$, then MDL rewrites as the ML setup:
$$p(f) = \prod_{\mathrm{samples}\;i} p(y_i|f(x_i)) \times \mathrm{prior}(f) \;\;\text{to be maximized}$$
$$\implies\;\;\;\;\; E(f) = -\log p(f) = \sum_i -\log p(y_i|f(x_i)) \;- \log \mathrm{prior}(f)\;\; \text{to be minimized}$$
$$\implies\;\;\;\;\; E(f) = \sum_i \mathrm{Loss}(f(x_i), y_i) \; + \|f\|\;\; \text{to be minimized}$$
Back to the original formulation of MDL
Actually, AIC and BIC are just approximations of MDL, not valid here
Let's go back to the original MDL formulation, and derive another, simple expression to estimate neural network complexity, which does not rely on parameter number
NB: we consider networks with probabilistic outputs, i.e. one does not compute a single value $y_i$ given an input $x_i$, but a distribution $p_\mathrm{predicted}(y_i)$, denoted $p_\mathrm{predicted}(y_i | x_i)$
main idea: do not encode the network already trained (millions of weights to encode!), but encode the process of training that network (cheap since input data $(x_i)$ are given), so that the training can be re-performed exactly from that information, and thus the trained network can be recomputed
start with always the same neural network initialization (chosen so that it is short to describe, e.g. the random seed)
at each training time step $t$ : get new samples $(x_i)$, predict for each of them an associated distribution probability $p_t(y_i|x_i)$, and compute how much it costs to describe the true $y_i$ according to that predicted distribution (better prediction means smaller cost): $-\log p_t(\mathrm{true\; label}\; y_i | x_i)$
→ NB: this quantity derives from information theory; if not familiar with it, here's a course about information theory oriented towards machine learning
total description length of this training algorithm: neglectible constant $+ \sum_t \sum_{i \in \mathrm{batch\; at\; time}\; t} - \log p_t(y_i | x_i)$ where $p_t$ is computed by the network at training step $t$ (when not knowing the $y_i$ of its new batch yet).
note that this quantity is related to the performance of the network, its trainability, but not directly to the number of its parameters
MDL is still compatible with deep learning!
train many models (different architectures, etc.) and pick the one with the lowest description length: it will generalize better, according to the MDL principle / Solomonoff prior
can spot, in the description cost, the part due to parameters (weights) [information stored in the network], from the part due to uncertainty in output :
compute the description length as if one had known the trained network from the beginning: $\sum_{\mathrm{all\;samples} \;i\; \mathrm{from\;all\;batches}} -\log p_\mathrm{trained}(y_i|x_i)$
compute the difference with the real total description length above : the difference expresses how much information is stored in the weights of the trained network, i.e. how much the network learned