
# Chapter 1: Deep learning vs. classical machine learning and optimization

NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: an old raw text file is also available if you wish for a more compact summary

Overview:

I - Going Deep or not? II - Gap between classical Machine Learning and Deep Learning III - Palliatives for regularization IV - Optimization landscape / properties [small break] V - ML fundamentals (MDL) are still there! VI - Architectures as priors on function space, initializations as random nonlinear projections

Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks

## I - Going Deep or not?

### No guarantee

No guarantee (beforehand) that the training (with such architecture, criterion to optimize) will lead to a good solution

### Uselessness of universal approximation theorems (expressive power)

• Main universal approximation theorems:
• These theorems are about possibilities to learn "by heart" a function, given many samples enough (i.e. a dense sampling of all possible inputs), but do not tell anything about possible generalization properties to new samples (learnability of a function?)
• Example: would need an infinite amount of samples to learn a periodic function on $\R$ (no generalization ability)

### Depth simplifies the approximation/estimation task:

Why deep: examples of depth vs. layer size compromises with explicit bounds

### Does it work? When?

Examples of successes and failures of deep learning vs. classical techniques (random forests)
• successes in computer vision: due to handling image properties well: explicit geometry handling, invariance to translation, hierarchical models (see Chapter 2)
• random forest (e.g.), without geometry information (difficult to incorporate), would require massively more data to learn anything from images (no generalization power [from a variable, representing a pixel color, to the next one] without geometry knowledge: what is learned for one pixel has to be re-learned for all other pixels)
• on the opposite: if no data structure : random forest can do the job; if small data: neural network won't be able to
• success in RL:
as a function of processing power: running Alpha-Go/0 on a huge GPU farm for hours, having played zillions of games: is it a success?
• 44 million training games (not achievable during one life, sounds more like Humanity total), 4 hours on 5000 TPU = more than 2 years of single TPU = thousands of years on a single CPU?
• sounds like very slow learning; wouldn't other methods perform as well with such computational power? [Benjamin Recht's presentation at DALI, slide 25]
• same with NLP state-of-the-art huge networks (BERT, XLnet, etc.): may cost hundreds of thousands of dollars to train

## Gap between classical Machine Learning and Deep Learning

Forgotten Machine Learning basics (Minimum Description Length principle, regularizers, objective function different from evaluation criterion) and incidental palliatives (drop-out)

### Reminder: ML setup

• Given samples $(x_i, y_i)$,
• estimate $f: x_i \mapsto \, \approx y_i$,
• quantify goodness of output with a "loss function" $\mathrm{Loss}$,
• search for $$\inf_{f \;\in\; \mathrm{predefined\; parameterized\; family}} \sum_{\mathrm{examples}\;i} \mathrm{Loss}(f(x_i), y_i) + \mathrm{Regularizer}(f)$$
• where the regularizer makes sure that $f$ is regular enough (typically: smooth), hoping for generalization properties

### Some ML / optimization basics seem to be not really true anymore (no overfit with millions of parameters?)

• millions of parameters : what about Occam's razor (shortest or simplest explanation is the best) / Minimum Description Length ?
• possible to train networks with millions of parameters without overfitting (still generalizes well, i.e. small gap between train and test error)
• and this with fewer examples than parameters : estimator convergence?
• highly non-convex optimization in a high-dimensional space : thought to be very hard and to be avoided
• add noise in the optimization process, it helps (!)
• train to optimize a criterion (cross-entropy), but evaluate on another one (accuracy)... (because the evaluation criterion is not differentiable or doesn't have good gradient properties)
• common recommendation: when trying to solve a new task: check first that the model is able to overfit (!) (in contrast with picking a regularized search space)

### A closer look at overfitting

[Understanding deep learning requires rethinking generalization, Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, ICLR 2017] & presentation by Benjamin Recht at DALI
• huge models can do the job (might not overfit):

→ source: Benjamin Recht's slides 9 & 11, respectively from:
• they can completely overfit too (learn random labels) : shown theoretically & experimentally
• theory: minimal network size for overfitting is very small (given the sample, to fit random labels):
Theorem: There exists a two-layer neural network with ReLU activations and $2n + d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.
i.e. whatever the labels to predict are, it is possible to tune this small network to output those labels on those $n$ samples
• $\implies$ capacity is not the issue (Rademacher or Vapnik-Chervonenkis), rather control / regularization / telling which functions are preferable
• how to know when overfitting or not? / how to prevent overfitting? [to be discussed later]
• possible regularizers: weight decay, dropout, data augmentation → not clear, from experiments (not sufficient/necessary even though do help)
• early stopping = regularizer
• SGD, as a optimization process, regularizes also

## Palliatives for regularization

[Function Norms and Regularization in Deep Networks, Amal Rannen Triki, Maxim Berman, Matthew Blaschko]
• NP-hard to compute the functional norm, because NP-hard to estimate whether a neural net is the function 0... (example: if binary inputs/activities, determining whether output is always 0 is a SAT problem)
• but stochastic approximation of the norm = tractable

### Drop-out

At each time step, select randomly half of the neurons and temporarily drop them (replace with 0)

### Early stopping

i.e. when training, store a copy $M(t)$ of the neural network $M$ at each time step $t$, and in the end pick the one $M(t')$ that has the lowest validation error.
→ standard Machine Learning process to select the best model among a list of models

Other point of view:
Quoting [Deep Learning book, Ian Goodfellow, Yoshua Bengio, Aaron Courville]:
Bishop [Regularization and Complexity Control in Feed-forward Networks, Christopher Bishop, ICANN 1995] and Sjoberg and Ljung [Overtraining, Regularization, and Searching for Minimum in Neural Networks, J. Sjöberg, L. Ljung, International Journal of Control 1995] argued that early stopping has the effect of restricting the optimization procedure to a relatively small volume of parameter space in the neighborhood of the initial parameter value $\theta_0$.
• no time to get specialized too much (too far from generic initialization) into overfitting minima
• same for weight decay then
• optimization $\neq$ usual gradient descent till convergence, but rather just an optimization *process* (pile of recipes, selected manually by trial and error, to have a positive effect on neural network training)

### Optimization noise acts as a regularizer

[Tomaso Poggio, various publications]
• Around convergence: if the Hessian $\left(\frac{d^2f}{dw^2}\right)$ is definite positive ("def $>0$"), then SGD noise does not harm:
• def $>0$ $\implies$ convex problem: any noisy step away from the solution will be corrected later as it increases the loss
• but if the Hessian is degenerated (positive non-definite: several 0 eigenvalues, i.e. flat landscape in some directions, i.e. a continuum of global minima):
• in a batch approach, i.e. if minimizing $\sum_{\text{all samples}\; i} \mathrm{Loss}(f(x_i))$: noisy steps going away from an initial global solution will be corrected only if they increase the training error
• noisy steps increasing the test error but not the training error won't be corrected
• $\implies$ training error remains constant but test error increases with time
• $\implies$ SGD overfits after a while
• Consequently:
• make the Hessian become def $>0$ : add a weight regularizer ($\|w\|_2$)
• contrary to: weight normalization (of the connexion matrix between two layers) + RELU $\;\;\implies\;\;$ regularizes but not def $>0$

### Overparameterization helps (hot topic)

To understand this, need some more detailed optimization landscape analysis

## Optimization landscape / properties [small break]

### Back to: Overparameterization helps

[Scaling description of generalization with number of parameters in deep learning; Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart, 2019]
• Spot a phenomenon known as "double gradient descent" in the case of training without regularization and without early stopping: overfit disappears with many neurons enough!
• Prove theoretically and experimentally that with a high-enough number $n$ of neurons, there is no overfitting problem anymore (overfit decreases in $\frac{1}{\sqrt{n}}$ for $n$ large enough)
• $\implies$ when designing an architecture, and choosing the number of neurons, do not worry that too many neurons might cause overfit or optimization issues : on the opposite, overparameterization helps and reduces such issues (if sufficiently overparameterized). It might slow down computations, but not the quality of the result.

To go further:
[Deep Double Descent: Where Bigger Models and More Data Hurt; Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever, 2019]
• Same phenomenon, studied as a function of the number of training epochs: when training a large architecture (without regularization), overfit is followed by standard test error decrease again
• and studied also a function of the number of samples (dataset size): for some fixed architectures and fixed number of training epochs, increasing the dataset size might lead to bigger overfit!

Quoting [Stanford class on CNN for Visual Recognition] : good summary about neural net complexity and performance :
Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.
The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks. In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you'll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.

• $\implies$ don't evaluate the complexity of the network (or its optimization) by its number of neurons (not a good proxy) : larger networks are easier to train (more local minima, but good ones, to the opposite of small architectures)
• then, how?

## ML fundamentals (MDL) are still there!

### High redundancy inside neural networks

MDL principle is lost? number of parameters = huge, but:

For more details on how to adapt the minimum description length principle (MDL) to large networks, check the bonus part.

## Architectures as priors on function space, initializations as random nonlinear projections

Cf Chapter 2 on architectures.

A good read / source of references: "Deep Learning" book by Ian Goodfellow, Yoshua Bengio and Aaron Courville

## Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks

### Reminder: Minimum Description Length (MDL)

Origin of the ML setup (theoretical justification, from information theory)
• Occam's razor: shorter explanation = the best one, will generalize better to new examples
• $\implies$ find the shortest program that produces $(y_i)$, being given $(x_i)$
• Solomonoff's prior: proba$(f) = 2^{\,- \,\mathrm{Kolmogorov\; complexity}(f) }$ : defines a prior probability on functions (or programs) $f$ as reflecting its complexity ($\implies$ prefer simple functions)
• usual approximation of Kolmogorov complexity :
• Akaike information criterion (AIC): number $n$ of parameters
• or Bayesian information criterion (BIC): $\; n \log N$, with $N$ = number of samples
• in the Bayesian probabilistic setup: choose a prior over functions, e.g. prior$(f) \propto e^{- \|f\|}$ for some norm, and measure the likelihood of $f$ to produce the right answer: $p(y | f(x))$, then MDL rewrites as the ML setup: $$p(f) = \prod_{\mathrm{samples}\;i} p(y_i|f(x_i)) \times \mathrm{prior}(f) \;\;\text{to be maximized}$$ $$\implies\;\;\;\;\; E(f) = -\log p(f) = \sum_i -\log p(y_i|f(x_i)) \;- \log \mathrm{prior}(f)\;\; \text{to be minimized}$$ $$\implies\;\;\;\;\; E(f) = \sum_i \mathrm{Loss}(f(x_i), y_i) \; + \|f\|\;\; \text{to be minimized}$$

### Back to the original formulation of MDL

Actually, AIC and BIC are just approximations of MDL, not valid here
• Let's go back to the original MDL formulation, and derive another, simple expression to estimate neural network complexity, which does not rely on parameter number
• NB: we consider networks with probabilistic outputs, i.e. one does not compute a single value $y_i$ given an input $x_i$, but a distribution $p_\mathrm{predicted}(y_i)$, denoted $p_\mathrm{predicted}(y_i | x_i)$
• We describe here [The Description Length of Deep Learning Models, Leonard Blier, Yann Ollivier, NIPS 2018] : compression, not of the network itself, but of $\left(\left.\mathrm{network}, \mathrm{labels}\; (y_i) \; \right|\; \mathrm{inputs}\; (x_i)\right)$.
• main idea: do not encode the network already trained (millions of weights to encode!), but encode the process of training that network (cheap since input data $(x_i)$ are given), so that the training can be re-performed exactly from that information, and thus the trained network can be recomputed
• start with always the same neural network initialization (chosen so that it is short to describe, e.g. the random seed)
• at each training time step $t$ : get new samples $(x_i)$, predict for each of them an associated distribution probability $p_t(y_i|x_i)$, and compute how much it costs to describe the true $y_i$ according to that predicted distribution (better prediction means smaller cost): $-\log p_t(\mathrm{true\; label}\; y_i | x_i)$
→ NB: this quantity derives from information theory; if not familiar with it, here's a course about information theory oriented towards machine learning
• total description length of this training algorithm: neglectible constant $+ \sum_t \sum_{i \in \mathrm{batch\; at\; time}\; t} - \log p_t(y_i | x_i)$ where $p_t$ is computed by the network at training step $t$ (when not knowing the $y_i$ of its new batch yet).
• note that this quantity is related to the performance of the network, its trainability, but not directly to the number of its parameters
• MDL is still compatible with deep learning!
• train many models (different architectures, etc.) and pick the one with the lowest description length: it will generalize better, according to the MDL principle / Solomonoff prior
• can spot, in the description cost, the part due to parameters (weights) [information stored in the network], from the part due to uncertainty in output :
• compute the description length as if one had known the trained network from the beginning: $\sum_{\mathrm{all\;samples} \;i\; \mathrm{from\;all\;batches}} -\log p_\mathrm{trained}(y_i|x_i)$
• compute the difference with the real total description length above : the difference expresses how much information is stored in the weights of the trained network, i.e. how much the network learned

Back to the main page of the course