$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\R}{\mathbb{R}}$

Deep Learning in Practice

Chapter 1: Deep learning vs. classical machine learning and optimization

NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: an old raw text file is also available if you wish for a more compact summary


I - Going Deep or not? II - Gap between classical Machine Learning and Deep Learning III - Palliatives for regularization IV - Optimization landscape / properties [small break] V - ML fundamentals (MDL) are still there! VI - Architectures as priors on function space, initializations as random nonlinear projections

Additional reference

Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks

I - Going Deep or not?

No guarantee

No guarantee (beforehand) that the training (with such architecture, criterion to optimize) will lead to a good solution

Uselessness of universal approximation theorems (expressive power)

Depth simplifies the approximation/estimation task:

Why deep: examples of depth vs. layer size compromises with explicit bounds

Does it work? When?

Examples of successes and failures of deep learning vs. classical techniques (random forests)

Gap between classical Machine Learning and Deep Learning

Forgotten Machine Learning basics (Minimum Description Length principle, regularizers, objective function different from evaluation criterion) and incidental palliatives (drop-out)

Reminder: ML setup

Some ML / optimization basics seem to be not really true anymore (no overfit with millions of parameters?)

Deep learning vs classical MK joke

A closer look at overfitting

[Understanding deep learning requires rethinking generalization, Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, ICLR 2017] & presentation by Benjamin Recht at DALI

Palliatives for regularization

What about adding functional regularizers?

[Function Norms and Regularization in Deep Networks, Amal Rannen Triki, Maxim Berman, Matthew Blaschko]


At each time step, select randomly half of the neurons and temporarily drop them (replace with 0)

Early stopping

i.e. when training, store a copy $M(t)$ of the neural network $M$ at each time step $t$, and in the end pick the one $M(t')$ that has the lowest validation error.
→ standard Machine Learning process to select the best model among a list of models

Other point of view:
Quoting [Deep Learning book, Ian Goodfellow, Yoshua Bengio, Aaron Courville]:
Bishop [Regularization and Complexity Control in Feed-forward Networks, Christopher Bishop, ICANN 1995] and Sjoberg and Ljung [Overtraining, Regularization, and Searching for Minimum in Neural Networks, J. Sjöberg, L. Ljung, International Journal of Control 1995] argued that early stopping has the effect of restricting the optimization procedure to a relatively small volume of parameter space in the neighborhood of the initial parameter value $\theta_0$.

Optimization noise acts as a regularizer

[Tomaso Poggio, various publications]

Overparameterization helps (hot topic)

To understand this, need some more detailed optimization landscape analysis

Optimization landscape / properties [small break]

Local minima and saddle points

Lots of works on convergence:

Back to: Overparameterization helps

[Scaling description of generalization with number of parameters in deep learning; Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart, 2019]
To go further:
[Deep Double Descent: Where Bigger Models and More Data Hurt; Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever, 2019]
Quoting [Stanford class on CNN for Visual Recognition] : good summary about neural net complexity and performance :
Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.
The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks. In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you'll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.

ML fundamentals (MDL) are still there!

High redundancy inside neural networks

MDL principle is lost? number of parameters = huge, but:
For more details on how to adapt the minimum description length principle (MDL) to large networks, check the bonus part.

Architectures as priors on function space, initializations as random nonlinear projections

Cf Chapter 2 on architectures.

Additional reference:

A good read / source of references: "Deep Learning" book by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Bonus: Adapting the Minimum Description Length principle (MDL) to neural networks

Reminder: Minimum Description Length (MDL)

Origin of the ML setup (theoretical justification, from information theory)

Back to the original formulation of MDL

Actually, AIC and BIC are just approximations of MDL, not valid here

Back to the main page of the course

Valid HTML 4.0 Transitional