I - Going Deep or not?

- No guarantee
- Uselessness of universal approximation theorems (expressive power)
- Depth simplifies the approximation/estimation task
- Does it work? When?

- Reminder: ML setup
- Some ML / optimization basics seem to be not really true anymore (no overfit with millions of parameters?)
- A closer look at overfitting

- Drop-out
- What about adding functional regularizers?
- Early stopping
- Optimization noise acts as a regularizer
- Overparameterization helps (hot topic)

Additional reference

Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks

- Main universal approximation theorems:
- [George Cybenko, 1989; Kurt Hornik, 1991; Wikipedia summary]: one hidden layer is sufficient to approximate any continuous function (defined on a compact set such as $[0,1]^d$) arbitrarily precisely if many neurons enough (whatever continuous non-constant activation function is used)
- [On the structure of continuous functions of several variables, David A. Sprecher, 1965]
- + an exact representation theorem by Kolmogorov [1956] (
**exact**decomposition of any continuous function with a finite number of neurons with specific activation functions) : but with completely discontinuous proof (hashing)

- These theorems are about possibilities to learn "by heart" a function, given many samples enough (i.e. a dense sampling of all possible inputs), but do not tell anything about possible generalization properties to new samples (learnability of a function?)
- Example: would need an infinite amount of samples to learn a periodic function on $\R$ (no generalization ability)

- [Why does deep and cheap learning work so well? Henry W. Lin, Max Tegmark, David Rolnick, Journal of Statistics, 2017] : multiplication of $n$ input variables :
- technical preamble: a multiplication can be approximated with 4 neurons (performing sums), provided the activation function is smooth [developable]
- then: multiplication of $n$ variables is possible with a binary tree of depth $\log n$, with $4 n$ nodes total
- while they prove a flat network would require $\sum_{\mathrm{all}\; 2^n \;\mathrm{configurations}}$ just to be correct on binary numbers (each input is either 0 or 1) $\implies$ exponentially many nodes

- [Representation Benefits of Deep Feedforward Networks, Matus Telgarsky, 2015] : example: function = $2^m$ consecutive small triangles /\/\/\/\/\/\ = ( /\ ) $\circ^m$ (composed $m$ times with itself)
- representable with a thin deep network (depth $m$, 1 neuron per layer)
- while in a flat network: 2 layers require provably $2^{m/2}$ nodes
- and in a moderately flat network: $\sqrt m$ layers require $2^{\sqrt m}$ nodes, etc.

- [Learning Functions: When Is Deep Better Than Shallow, Hrushikesh Mhaskar, Qianli Liao, Tomaso Poggio, 2016] : architecture suitability for function estimation
- Theorem: if function to estimate = computational tree of input variables, then required number of samples to learn a similar-shaped network: $d$ instead of $\exp d$

- deep is sufficient (without width): [ResNet with one-neuron hidden layers is a Universal Approximator, Hongzhou Lin, Stefanie Jegelka, NIPS 2018]

- successes in computer vision: due to handling image properties well: explicit geometry handling, invariance to translation, hierarchical models (see Chapter 2)
- random forest (e.g.), without geometry information (difficult to incorporate), would require massively more data to learn anything from images (no generalization power [from a variable, representing a pixel color, to the next one] without geometry knowledge: what is learned for one pixel has to be re-learned for all other pixels)
- on the opposite: if no data structure : random forest can do the job; if small data: neural network won't be able to

- success in RL:

as a function of processing power: running Alpha-Go/0 on a huge GPU farm for hours, having played zillions of games: is it a success?- 44 million training games (not achievable during one life, sounds more like Humanity total), 4 hours on 5000 TPU = more than 2 years of single TPU = thousands of years on a single CPU?
- sounds like very slow learning; wouldn't other methods perform as well with such computational power? [Benjamin Recht's presentation at DALI, slide 25]

- same with NLP state-of-the-art huge networks (BERT, XLnet, etc.): may cost hundreds of thousands of dollars to train

- Given samples $(x_i, y_i)$,
- estimate $f: x_i \mapsto \, \approx y_i$,
- quantify goodness of output with a "loss function" $\mathrm{Loss}$,
- search for $$\inf_{f \;\in\; \mathrm{predefined\; parameterized\; family}} \sum_{\mathrm{examples}\;i} \mathrm{Loss}(f(x_i), y_i) + \mathrm{Regularizer}(f)$$
- where the regularizer makes sure that $f$ is regular enough (typically: smooth), hoping for generalization properties

- millions of parameters : what about Occam's razor (shortest or simplest explanation is the best) / Minimum Description Length ?
- possible to train networks with millions of parameters without overfitting (still generalizes well, i.e. small gap between train and test error)
- and this with fewer examples than parameters : estimator convergence?
- highly non-convex optimization in a high-dimensional space : thought to be very hard and to be avoided
- add noise in the optimization process, it helps (!)
- train to optimize a criterion (cross-entropy), but evaluate on another one (accuracy)... (because the evaluation criterion is not differentiable or doesn't have good gradient properties)
- common recommendation: when trying to solve a new task: check first that the model is able to overfit (!) (in contrast with picking a regularized search space)

- huge models can do the job (might not overfit):

→ source: Benjamin Recht's slides 9 & 11, respectively from:- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Zhifeng Chen, 2018]
- and [Matrix Factorization Techniques for Recommender Systems, Yehuda Koren, Robert Bell, Chris Volinsky, Computer 2009]

- they can completely overfit too (learn random labels) : shown theoretically & experimentally
- theory: minimal network size for overfitting is very small (given the sample, to fit random labels):

Theorem: There exists a two-layer neural network with ReLU activations and $2n + d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.

i.e. whatever the labels to predict are, it is possible to tune this small network to output those labels on those $n$ samples - $\implies$ capacity is not the issue (Rademacher or Vapnik-Chervonenkis), rather control / regularization / telling which functions are preferable

- theory: minimal network size for overfitting is very small (given the sample, to fit random labels):
- how to know when overfitting or not? / how to prevent overfitting? [to be discussed later]
- possible regularizers: weight decay, dropout, data augmentation → not clear, from experiments (not sufficient/necessary even though do help)
- early stopping = regularizer
- SGD, as a optimization process, regularizes also

- NP-hard to compute the functional norm, because NP-hard to estimate whether a neural net is the function 0... (example: if binary inputs/activities, determining whether output is always 0 is a SAT problem)
- but stochastic approximation of the norm = tractable

- robustness to noise in neuron activity $a_i$ $\;\;\Longleftrightarrow\;\;$ requiring $\frac{df}{da_i} \simeq 0$
- $\implies$ not far from an explicit regularizer $\|\nabla_x f\|$

- ensemble point of view: average over probabilistic architectures [Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Yarin Gal, Zoubin Ghahramani, ICML 2016]
- each drop-out realization leads a different (sub-)network (since half of the nodes are missing)
- optimize the expectation (of the loss) over possible drop-out realizations $\simeq$ ensemble method averaging the output of all possible sub-networks

→ standard Machine Learning process to select the best model among a list of models

Other point of view:

Quoting [Deep Learning book, Ian Goodfellow, Yoshua Bengio, Aaron Courville]:

- no time to get specialized too much (too far from generic initialization) into overfitting minima
- same for weight decay then
- optimization $\neq$ usual gradient descent till convergence, but rather just an optimization *process* (pile of recipes, selected manually by trial and error, to have a positive effect on neural network training)

- Around convergence: if the Hessian $\left(\frac{d^2f}{dw^2}\right)$ is definite positive ("def $>0$"), then SGD noise does not harm:
- def $>0$ $\implies$ convex problem: any noisy step away from the solution will be corrected later as it increases the loss

- but if the Hessian is degenerated (positive non-definite: several 0 eigenvalues, i.e. flat landscape in some directions, i.e. a continuum of global minima):
- in a batch approach, i.e. if minimizing $\sum_{\text{all samples}\; i} \mathrm{Loss}(f(x_i))$: noisy steps going away from an initial global solution will be corrected only if they increase the training error
- noisy steps increasing the test error but not the training error won't be corrected
- $\implies$ training error remains constant but test error increases with time
- $\implies$ SGD overfits after a while

- Consequently:
- make the Hessian become def $>0$ : add a weight regularizer ($\|w\|_2$)
- contrary to: weight normalization (of the connexion matrix between two layers) + RELU $\;\;\implies\;\;$ regularizes but not def $>0$

- swapping neurons in a layer (i.e. changing the ordering) does not change the function computed (permutation invariance) $\implies$ lots of minima... (any local minimum has factorial($N$) copies, i.e. the number of possible permutations of neurons, or, actually, $\prod_l N_l!$ where $N_l$ is the number of neurons in layer $l$)
- or Poggio's bound on the number of global minima using Bezout's theorem for polynomial-activation networks (here global minimum := zero training error $\implies$ 0 of a polynomial) [Theory II: Landscape of the Empirical Risk in Deep Learning, Qianli Liao, Tomaso Poggio, 2017]
- so many parameters $\implies$ local minima = very good (because huge neighborhood: derivative is 0 in many many directions) (different to classical optimization)
- many local minima, but also: even more: saddle points

→ randomly-generated Hessians (as random symmetric matrices) are much more likely to be saddle points (i.e. have eigenvalues of different signs) than local minima (definite positive: all eigenvalues positive) [Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio, NIPS 2014]

- usually under particular assumptions
- examples:
- random initialization + SGD on an overparameterized network : converges (well) not very far from initialization (2-layer network) [Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data, Yuanzhi Li, Yingyu Liang, NIPS 2018]
- good convergence (under assumptions): [Gradient Descent Finds Global Minima of Deep Neural Networks, Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai, 2018]

- Spot a phenomenon known as "double gradient descent" in the case of training without regularization and without early stopping: overfit disappears with many neurons enough!
- Prove theoretically and experimentally that with a high-enough number $n$ of neurons, there is no overfitting problem anymore (overfit decreases in $\frac{1}{\sqrt{n}}$ for $n$ large enough)
- $\implies$ when designing an architecture, and choosing the number of neurons, do not worry that too many neurons might cause overfit or optimization issues : on the opposite, overparameterization helps and reduces such issues (if sufficiently overparameterized). It might slow down computations, but not the quality of the result.

To go further:

[Deep Double Descent: Where Bigger Models and More Data Hurt; Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever, 2019]

- Same phenomenon, studied as a function of the number of training epochs: when training a large architecture (without regularization), overfit is followed by standard test error decrease again
- and studied also a function of the number of samples (dataset size): for some fixed architectures and fixed number of training epochs, increasing the dataset size might lead to bigger overfit!

Quoting [Stanford class on CNN for Visual Recognition] : good summary about neural net complexity and performance :

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks. In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you'll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.

- $\implies$ don't evaluate the complexity of the network (or its optimization) by its number of neurons (not a good proxy) : larger networks are easier to train (more local minima, but good ones, to the opposite of small architectures)
- then, how?

- no precision needed (in the encoding of parameters)
- high redundancy (different parts of the network may compute similar quantities)
- high redundancy proof with neural network compression: impressive rate (using pruning, quantization, tensor factorization...: compression factor > 100 )

→ thanks to huge compression rates, able to run online video object detection on a smartphone [Bayesian Compression for Deep Learning, Christos Louizos, Karen Ullrich, Max Welling, NIPS 2017]

For more details on how to adapt the minimum description length principle (MDL) to large networks, check the bonus part.

- Occam's razor: shorter explanation = the best one, will generalize better to new examples
- $\implies$ find the shortest program that produces $(y_i)$, being given $(x_i)$
- Solomonoff's prior: proba$(f) = 2^{\,- \,\mathrm{Kolmogorov\; complexity}(f) }$ : defines a prior probability on functions (or programs) $f$ as reflecting its complexity ($\implies$ prefer simple functions)
- usual approximation of Kolmogorov complexity :
- Akaike information criterion (AIC): number $n$ of parameters
- or Bayesian information criterion (BIC): $\; n \log N$, with $N$ = number of samples

- in the Bayesian probabilistic setup: choose a prior over functions, e.g. prior$(f) \propto e^{- \|f\|}$ for some norm, and measure the likelihood of $f$ to produce the right answer: $p(y | f(x))$, then MDL rewrites as the ML setup: $$p(f) = \prod_{\mathrm{samples}\;i} p(y_i|f(x_i)) \times \mathrm{prior}(f) \;\;\text{to be maximized}$$ $$\implies\;\;\;\;\; E(f) = -\log p(f) = \sum_i -\log p(y_i|f(x_i)) \;- \log \mathrm{prior}(f)\;\; \text{to be minimized}$$ $$\implies\;\;\;\;\; E(f) = \sum_i \mathrm{Loss}(f(x_i), y_i) \; + \|f\|\;\; \text{to be minimized}$$

- Let's go back to the original MDL formulation, and derive another, simple expression to estimate neural network complexity, which does not rely on parameter number
- NB: we consider networks with probabilistic outputs, i.e. one does not compute a single value $y_i$ given an input $x_i$, but a distribution $p_\mathrm{predicted}(y_i)$, denoted $p_\mathrm{predicted}(y_i | x_i)$
- We describe here [The Description Length of Deep Learning Models, Leonard Blier, Yann Ollivier, NIPS 2018] : compression, not of the network itself, but of $\left(\left.\mathrm{network}, \mathrm{labels}\; (y_i) \; \right|\; \mathrm{inputs}\; (x_i)\right)$.
- main idea: do not encode the network already trained (millions of weights to encode!), but encode the
*process*of training that network (cheap since input data $(x_i)$ are given), so that the training can be re-performed exactly from that information, and thus the trained network can be recomputed- start with always the same neural network initialization (chosen so that it is short to describe, e.g. the random seed)
- at each training time step $t$ : get new samples $(x_i)$, predict for each of them an associated distribution probability $p_t(y_i|x_i)$, and compute how much it costs to describe the true $y_i$ according to that predicted distribution (better prediction means smaller cost): $-\log p_t(\mathrm{true\; label}\; y_i | x_i)$

→ NB: this quantity derives from information theory; if not familiar with it, here's a course about information theory oriented towards machine learning - total description length of this training algorithm: neglectible constant $+ \sum_t \sum_{i \in \mathrm{batch\; at\; time}\; t} - \log p_t(y_i | x_i)$ where $p_t$ is computed by the network at training step $t$ (when not knowing the $y_i$ of its new batch yet).
- note that this quantity is related to the performance of the network, its trainability, but not directly to the number of its parameters
- MDL is still compatible with deep learning!

- train many models (different architectures, etc.) and pick the one with the lowest description length: it will generalize better, according to the MDL principle / Solomonoff prior
- can spot, in the description cost, the part due to parameters (weights) [information stored in the network], from the part due to uncertainty in output :
- compute the description length as if one had known the trained network from the beginning: $\sum_{\mathrm{all\;samples} \;i\; \mathrm{from\;all\;batches}} -\log p_\mathrm{trained}(y_i|x_i)$
- compute the difference with the real total description length above : the difference expresses how much information is stored in the weights of the trained network, i.e. how much the network learned

Back to the main page of the course