$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\R}{\mathbb{R}}$

Deep Learning in Practice

Chapter 1: Deep learning vs. classical machine learning and optimization

NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: an old raw text file is also available if you wish for a more compact summary

Overview:

I - Going Deep or not?

No guarantee
Uselessness of universal approximation theorems (expressive power)
Depth simplifies the approximation/estimation task
Does it work? When?

II - Gap between classical Machine Learning and Deep Learning

Reminder: ML setup
Some ML / optimization basics seem to be not really true anymore (no overfit with millions of parameters?)
A closer look at overfitting

III - Palliatives for regularization

Drop-out
What about adding functional regularizers?
Early stopping
Optimization noise acts as a regularizer
Overparameterization helps (hot topic)

IV - Optimization landscape / properties [small break]

Local minima and saddle points
Lots of works on convergence
Back to: Overparameterization helps

V - ML fundamentals (MDL) are still there!

High redundancy inside neural networks

VI - Architectures as priors on function space, initializations as random nonlinear projections

Additional reference

Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks

Reminder: Minimum Description Length (MDL)
Back to the original formulation of MDL

I - Going Deep or not?

No guarantee

No guarantee (beforehand) that the training (with such architecture, criterion to optimize) will lead to a good solution

Uselessness of universal approximation theorems (expressive power)

Main universal approximation theorems:
- [George Cybenko, 1989; Kurt Hornik, 1991; Wikipedia summary]: one hidden layer is sufficient to approximate any continuous function (defined on a compact set such as $[0,1]^d$) arbitrarily precisely if many neurons enough (whatever continuous non-constant activation function is used)
- [On the structure of continuous functions of several variables, David A. Sprecher, 1965]
- + an exact representation theorem by Kolmogorov [1956] (exact decomposition of any continuous function with a finite number of neurons with specific activation functions) : but with completely discontinuous proof (hashing)
  - ⇒ [Representation properties of networks: Kolmogorov's theorem is irrelevant, Federico Girosi and Tomaso Poggio]
These theorems are about possibilities to learn "by heart" a function, given many samples enough (i.e. a dense sampling of all possible inputs), but do not tell anything about possible generalization properties to new samples (learnability of a function?)
Example: would need an infinite amount of samples to learn a periodic function on $\R$ (no generalization ability)

Depth simplifies the approximation/estimation task:

Why deep: examples of depth vs. layer size compromises with explicit bounds

[Why does deep and cheap learning work so well? Henry W. Lin, Max Tegmark, David Rolnick, Journal of Statistics, 2017] : multiplication of $n$ input variables :
- technical preamble: a multiplication can be approximated with 4 neurons (performing sums), provided the activation function is smooth [developable]
- then: multiplication of $n$ variables is possible with a binary tree of depth $\log n$, with $4 n$ nodes total
- while they prove a flat network would require $\sum_{\mathrm{all}\; 2^n \;\mathrm{configurations}}$ just to be correct on binary numbers (each input is either 0 or 1) $\implies$ exponentially many nodes
[Representation Benefits of Deep Feedforward Networks, Matus Telgarsky, 2015] : example: function = $2^m$ consecutive small triangles /\/\/\/\/\/\ = ( /\ ) $\circ^m$ (composed $m$ times with itself)
- representable with a thin deep network (depth $m$, 1 neuron per layer)
- while in a flat network: 2 layers require provably $2^{m/2}$ nodes
- and in a moderately flat network: $\sqrt m$ layers require $2^{\sqrt m}$ nodes, etc.
[Learning Functions: When Is Deep Better Than Shallow, Hrushikesh Mhaskar, Qianli Liao, Tomaso Poggio, 2016] : architecture suitability for function estimation
- Theorem: if function to estimate = computational tree of input variables, then required number of samples to learn a similar-shaped network: $d$ instead of $\exp d$
deep is sufficient (without width): [ResNet with one-neuron hidden layers is a Universal Approximator, Hongzhou Lin, Stefanie Jegelka, NIPS 2018]

Does it work? When?

Examples of successes and failures of deep learning vs. classical techniques (random forests)

successes in computer vision: due to handling image properties well: explicit geometry handling, invariance to translation, hierarchical models (see Chapter 2)
- random forest (e.g.), without geometry information (difficult to incorporate), would require massively more data to learn anything from images (no generalization power [from a variable, representing a pixel color, to the next one] without geometry knowledge: what is learned for one pixel has to be re-learned for all other pixels)
- on the opposite: if no data structure : random forest can do the job; if small data: neural network won't be able to
success in RL:
as a function of processing power: running Alpha-Go/0 on a huge GPU farm for hours, having played zillions of games: is it a success?
- 44 million training games (not achievable during one life, sounds more like Humanity total), 4 hours on 5000 TPU = more than 2 years of single TPU = thousands of years on a single CPU?
- sounds like very slow learning; wouldn't other methods perform as well with such computational power? [Benjamin Recht's presentation at DALI, slide 25] or more recently [Compute Trends Across Three Eras of Machine Learning, by Sevilla et al, IJCNN 2022]
same with NLP state-of-the-art huge networks (BERT, XLnet, etc.): may cost hundreds of thousands of dollars to train
- what about environmental footprint?

Gap between classical Machine Learning and Deep Learning

Forgotten Machine Learning basics (Minimum Description Length principle, regularizers, objective function different from evaluation criterion) and incidental palliatives (drop-out)

Reminder: ML setup

Given samples $(x_i, y_i)$,
estimate $f: x_i \mapsto \, \approx y_i$,
quantify goodness of output with a "loss function" $\mathrm{Loss}$,
search for $$\inf_{f \;\in\; \mathrm{predefined\; parameterized\; family}} \sum_{\mathrm{examples}\;i} \mathrm{Loss}(f(x_i), y_i) + \mathrm{Regularizer}(f)$$
where the regularizer makes sure that $f$ is regular enough (typically: smooth), hoping for generalization properties

Some ML / optimization basics seem to be not really true anymore (no overfit with millions of parameters?)

millions of parameters : what about Occam's razor (shortest or simplest explanation is the best) / Minimum Description Length ?
possible to train networks with millions of parameters without overfitting (still generalizes well, i.e. small gap between train and test error)
and this with fewer examples than parameters : estimator convergence?
highly non-convex optimization in a high-dimensional space : thought to be very hard and to be avoided
add noise in the optimization process, it helps (!)
train to optimize a criterion (cross-entropy), but evaluate on another one (accuracy)... (because the evaluation criterion is not differentiable or doesn't have good gradient properties)
common recommendation: when trying to solve a new task: check first that the model is able to overfit (!) (in contrast with picking a regularized search space)

A closer look at overfitting

[Understanding deep learning requires rethinking generalization, Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, ICLR 2017] & presentation by Benjamin Recht at DALI

huge models can do the job (might not overfit):
→ source: Benjamin Recht's slides 9 & 11, respectively from:
- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Zhifeng Chen, 2018]
- and [Matrix Factorization Techniques for Recommender Systems, Yehuda Koren, Robert Bell, Chris Volinsky, Computer 2009]
they can completely overfit too (learn random labels) : shown theoretically & experimentally
- theory: minimal network size for overfitting is very small (given the sample, to fit random labels):
  Theorem: There exists a two-layer neural network with ReLU activations and $2n + d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.
  i.e. whatever the labels to predict are, it is possible to tune this small network to output those labels on those $n$ samples
- $\implies$ capacity is not the issue (Rademacher or Vapnik-Chervonenkis), rather control / regularization / telling which functions are preferable
how to know when overfitting or not? / how to prevent overfitting? [to be discussed later]
- possible regularizers: weight decay, dropout, data augmentation → not clear, from experiments (not sufficient/necessary even though do help)
- early stopping = regularizer
- SGD, as a optimization process, regularizes also

Palliatives for regularization

What about adding functional regularizers?

[Function Norms and Regularization in Deep Networks, Amal Rannen Triki, Maxim Berman, Matthew Blaschko]

NP-hard to compute the functional norm, because NP-hard to estimate whether a neural net is the function 0... (example: if binary inputs/activities, determining whether output is always 0 is a SAT problem)
but stochastic approximation of the norm = tractable

Drop-out

At each time step, select randomly half of the neurons and temporarily drop them (replace with 0)

robustness to noise in neuron activity $a_i$ $\;\;\Longleftrightarrow\;\;$ requiring $\frac{df}{da_i} \simeq 0$
- $\implies$ not far from an explicit regularizer $\|\nabla_x f\|$
ensemble point of view: average over probabilistic architectures [Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Yarin Gal, Zoubin Ghahramani, ICML 2016]
- each drop-out realization leads a different (sub-)network (since half of the nodes are missing)
- optimize the expectation (of the loss) over possible drop-out realizations $\simeq$ ensemble method averaging the output of all possible sub-networks

Early stopping

i.e. when training, store a copy $M(t)$ of the neural network $M$ at each time step $t$, and in the end pick the one $M(t')$ that has the lowest validation error.
→ standard Machine Learning process to select the best model among a list of models

Other point of view:
Quoting [Deep Learning book, Ian Goodfellow, Yoshua Bengio, Aaron Courville]:
Bishop [Regularization and Complexity Control in Feed-forward Networks, Christopher Bishop, ICANN 1995] and Sjoberg and Ljung [Overtraining, Regularization, and Searching for Minimum in Neural Networks, J. Sjöberg, L. Ljung, International Journal of Control 1995] argued that early stopping has the effect of restricting the optimization procedure to a relatively small volume of parameter space in the neighborhood of the initial parameter value $\theta_0$.

no time to get specialized too much (too far from generic initialization) into overfitting minima
same for weight decay then
optimization $\neq$ usual gradient descent till convergence, but rather just an optimization *process* (pile of recipes, selected manually by trial and error, to have a positive effect on neural network training)

Optimization noise acts as a regularizer

[Tomaso Poggio, various publications]

Around convergence: if the Hessian $\left(\frac{d^2f}{dw^2}\right)$ is definite positive ("def $>0$"), then SGD noise does not harm:
- def $>0$ $\implies$ convex problem: any noisy step away from the solution will be corrected later as it increases the loss
but if the Hessian is degenerated (positive non-definite: several 0 eigenvalues, i.e. flat landscape in some directions, i.e. a continuum of global minima):
- in a batch approach, i.e. if minimizing $\sum_{\text{all samples}\; i} \mathrm{Loss}(f(x_i))$: noisy steps going away from an initial global solution will be corrected only if they increase the training error
- noisy steps increasing the test error but not the training error won't be corrected
- $\implies$ training error remains constant but test error increases with time
- $\implies$ SGD overfits after a while
Consequently:
- make the Hessian become def $>0$ : add a weight regularizer ($\|w\|_2$)
- contrary to: weight normalization (of the connexion matrix between two layers) + RELU $\;\;\implies\;\;$ regularizes but not def $>0$
This explains "grokking": sometimes (rarely), while training, a model overfits strongly, and then the overfit decreases: [Omnigrok: Grokking Beyond Algorithmic Data; Ziming Liu, Eric J. Michaud, Max Tegmark, 2022]

Overparameterization helps (hot topic)

To understand this, need some more detailed optimization landscape analysis

Optimization landscape / properties [small break]

Local minima and saddle points

swapping neurons in a layer (i.e. changing the ordering) does not change the function computed (permutation invariance) $\implies$ lots of minima... (any local minimum has factorial($N$) copies, i.e. the number of possible permutations of neurons, or, actually, $\prod_l N_l!$ where $N_l$ is the number of neurons in layer $l$)
or Poggio's bound on the number of global minima using Bezout's theorem for polynomial-activation networks (here global minimum := zero training error $\implies$ 0 of a polynomial) [Theory II: Landscape of the Empirical Risk in Deep Learning, Qianli Liao, Tomaso Poggio, 2017]
so many parameters $\implies$ local minima = very good (because huge neighborhood: derivative is 0 in many many directions) (different to classical optimization)
many local minima, but also: even more: saddle points
→ randomly-generated Hessians (as random symmetric matrices) are much more likely to be saddle points (i.e. have eigenvalues of different signs) than local minima (definite positive: all eigenvalues positive) [Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio, NIPS 2014]

Lots of works on convergence:

usually under particular assumptions
examples:
- random initialization + SGD on an overparameterized network : converges (well) not very far from initialization (2-layer network) [Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data, Yuanzhi Li, Yingyu Liang, NIPS 2018]
  - further on this topic: [On Lazy Training in Differentiable Programming, Lenaic Chizat, Edouard Oyallon, Francis Bach, NeurIPS 2019]
- good convergence (under assumptions): [Gradient Descent Finds Global Minima of Deep Neural Networks, Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai, 2018]

Back to: Overparameterization helps

[Scaling description of generalization with number of parameters in deep learning; Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart, 2019]

Spot a phenomenon known as "double gradient descent" in the case of training without regularization and without early stopping: for sufficiently-parameterized models, overfit increases with the number of parameters, but then decreases again!
Prove theoretically and experimentally that with a high-enough number $n$ of neurons, there is no optimization issue anymore, in the sense that there is convergence in $O\left(\frac{1}{\sqrt{n}}\right)$ to the average function over all possible initializations, which shows better generalization (as ensemble methods in general). NB: this does not say anything about the absolute generalization error of this average function.
$\implies$ when designing an architecture, and choosing the number of neurons, do not worry that too many neurons might cause overfit or optimization issues : on the opposite, overparameterization helps and reduces such issues (if sufficiently overparameterized). It might slow down computations, but not the quality of the result.

To go further:
[Deep Double Descent: Where Bigger Models and More Data Hurt; Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever, 2019]

Same phenomenon, studied as a function of the number of training epochs: when training a large architecture (without regularization), overfit is followed by standard test error decrease again
and studied also a function of the number of samples (dataset size): for some fixed architectures and fixed number of training epochs, increasing the dataset size might lead to bigger overfit!

Moreover:
[The Loss Surfaces of Multilayer Networks; Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun; AISTATS 2015]

Link between deep neural networks and Hamiltonian of spherical spin-glass model, enabling theoretical statistics over critical points (saddle points, mimima, etc.)
$\implies$ most local minima are within a certain band (of loss value) ; the number of local minima outside that band diminishes exponentially with the size of the network.
$\implies$ major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered.
recovering the global minimum becomes harder as the network size increases, and this is in practice irrelevant as global minimum often leads to overfitting.

Quoting [Stanford class on CNN for Visual Recognition] : good summary about neural net complexity and performance :
Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.
The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks. In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you'll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.

$\implies$ don't evaluate the complexity of the network (or its optimization) by its number of neurons (not a good proxy) : larger networks are easier to train (more local minima, but good ones, to the opposite of small architectures)
then, how?

ML fundamentals (MDL) are still there!

High redundancy inside neural networks

MDL principle is lost? number of parameters = huge, but:

no precision needed (in the encoding of parameters)
high redundancy (different parts of the network may compute similar quantities)
high redundancy proof with neural network compression: impressive rate (using pruning, quantization, tensor factorization...: compression factor > 100 )
→ thanks to huge compression rates, able to run online video object detection on a smartphone [Bayesian Compression for Deep Learning, Christos Louizos, Karen Ullrich, Max Welling, NIPS 2017]

For more details on how to adapt the minimum description length principle (MDL) to large networks, check the bonus part.

Architectures as priors on function space, initializations as random nonlinear projections

Cf Chapter 2 on architectures.

Additional reference:

A good read / source of references: "Deep Learning" book by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks

Reminder: Minimum Description Length (MDL)

Origin of the ML setup (theoretical justification, from information theory)

Occam's razor: shorter explanation = the best one, will generalize better to new examples
$\implies$ find the shortest program that produces $(y_i)$, being given $(x_i)$
Solomonoff's prior: proba$(f) = 2^{\,- \,\mathrm{Kolmogorov\; complexity}(f) }$ : defines a prior probability on functions (or programs) $f$ as reflecting its complexity ($\implies$ prefer simple functions)
usual approximation of Kolmogorov complexity :
- Akaike information criterion (AIC): number $n$ of parameters
- or Bayesian information criterion (BIC): $\; n \log N$, with $N$ = number of samples
in the Bayesian probabilistic setup: choose a prior over functions, e.g. prior$(f) \propto e^{- \|f\|}$ for some norm, and measure the likelihood of $f$ to produce the right answer: $p(y | f(x))$, then MDL rewrites as the ML setup: $$p(f) = \prod_{\mathrm{samples}\;i} p(y_i|f(x_i)) \times \mathrm{prior}(f) \;\;\text{to be maximized}$$ $$\implies\;\;\;\;\; E(f) = -\log p(f) = \sum_i -\log p(y_i|f(x_i)) \;- \log \mathrm{prior}(f)\;\; \text{to be minimized}$$ $$\implies\;\;\;\;\; E(f) = \sum_i \mathrm{Loss}(f(x_i), y_i) \; + \|f\|\;\; \text{to be minimized}$$

Back to the original formulation of MDL

Actually, AIC and BIC are just approximations of MDL, not valid here

Let's go back to the original MDL formulation, and derive another, simple expression to estimate neural network complexity, which does not rely on parameter number
NB: we consider networks with probabilistic outputs, i.e. one does not compute a single value $y_i$ given an input $x_i$, but a distribution $p_\mathrm{predicted}(y_i)$, denoted $p_\mathrm{predicted}(y_i | x_i)$
We describe here [The Description Length of Deep Learning Models, Leonard Blier, Yann Ollivier, NIPS 2018] : compression, not of the network itself, but of $\left(\left.\mathrm{network}, \mathrm{labels}\; (y_i) \; \right|\; \mathrm{inputs}\; (x_i)\right)$.
main idea: do not encode the network already trained (millions of weights to encode!), but encode the process of training that network (cheap since input data $(x_i)$ are given), so that the training can be re-performed exactly from that information, and thus the trained network can be recomputed
- start with always the same neural network initialization (chosen so that it is short to describe, e.g. the random seed)
- at each training time step $t$ : get new samples $(x_i)$, predict for each of them an associated distribution probability $p_t(y_i|x_i)$, and compute how much it costs to describe the true $y_i$ according to that predicted distribution (better prediction means smaller cost): $-\log p_t(\mathrm{true\; label}\; y_i | x_i)$
  → NB: this quantity derives from information theory; if not familiar with it, here's a course about information theory oriented towards machine learning
- total description length of this training algorithm: neglectible constant $+ \sum_t \sum_{i \in \mathrm{batch\; at\; time}\; t} - \log p_t(y_i | x_i)$ where $p_t$ is computed by the network at training step $t$ (when not knowing the $y_i$ of its new batch yet).
  - note that this quantity is related to the performance of the network, its trainability, but not directly to the number of its parameters
  - MDL is still compatible with deep learning!
train many models (different architectures, etc.) and pick the one with the lowest description length: it will generalize better, according to the MDL principle / Solomonoff prior
can spot, in the description cost, the part due to parameters (weights) [information stored in the network], from the part due to uncertainty in output :
- compute the description length as if one had known the trained network from the beginning: $\sum_{\mathrm{all\;samples} \;i\; \mathrm{from\;all\;batches}} -\log p_\mathrm{trained}(y_i|x_i)$
- compute the difference with the real total description length above : the difference expresses how much information is stored in the weights of the trained network, i.e. how much the network learned

Back to the main page of the course