$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\R}{\mathbb{R}}$

Deep Learning in Practice

Chapter 1: Introduction: gap between practice and theory


NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: an old raw text file is also available if you wish for a more compact summary

Overview:

I - Going Deep or not? II - Gap between classical Machine Learning and Deep Learning III - Palliatives for regularization IV - Optimization landscape / properties [small break] V - ML fundamentals (MDL) are still there! VI - Architectures as priors on function space, initializations as random nonlinear projections

Additional reference


I - Going Deep or not?

No guarantee

No guarantee (beforehand) that the training (with such architecture, criterion to optimize) will lead to a good solution

Uselessness of universal approximation theorems (expressive power)


Depth simplifies the approximation/estimation task:

Why deep: examples of depth vs. layer size compromises with explicit bounds

Does it work? When?

Examples of successes and failures of deep learning vs. classical techniques (random forests)
Petaflops

Gap between classical Machine Learning and Deep Learning

Forgotten Machine Learning basics (Minimum Description Length principle, regularizers, objective function different from evaluation criterion) and incidental palliatives (drop-out)

Reminder: ML setup


Reminder: Minimum Description Length (MDL)

Origin of the ML setup (theoretical justification, from information theory)

Some ML / optimization basics seem to be not really true anymore (no overfit with millions of parameters?)


Deep learning vs classical MK joke

A closer look at overfitting

[Understanding deep learning requires rethinking generalization, Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, ICLR 2017] & presentation by Benjamin Recht at DALI

Palliatives for regularization

Drop-out

At each time step, select randomly half of the neurons and temporarily drop them (replace with 0)

What about adding functional regularizers?

[Function Norms and Regularization in Deep Networks, Amal Rannen Triki, Maxim Berman, Matthew Blaschko]

Early stopping

i.e. when training, store a copy $M(t)$ of the neural network $M$ at each time step $t$, and in the end pick the one $M(t')$ that has the lowest validation error.
Quoting [Deep Learning book, Ian Goodfellow, Yoshua Bengio, Aaron Courville]:
Bishop [Regularization and Complexity Control in Feed-forward Networks, Christopher Bishop, ICANN 1995] and Sjoberg and Ljung [Overtraining, Regularization, and Searching for Minimum in Neural Networks, J. Sjöberg, L. Ljung, International Journal of Control 1995] argued that early stopping has the effect of restricting the optimization procedure to a relatively small volume of parameter space in the neighborhood of the initial parameter value $\theta_0$.

Optimization noise acts as a regularizer

[Tomaso Poggio, various publications]

Overparameterization helps (hot topic)

To understand this, need some more detailed optimization landscape analysis

Optimization landscape / properties [small break]

Local minima and saddle points


Lots of works on convergence:


Back to: Overparameterization helps

[Scaling description of generalization with number of parameters in deep learning; Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, Matthieu Wyart]
Quoting [Stanford class on CNN for Visual Recognition] : good summary about neural net complexity and performance :
Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.
The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks. In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you'll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.


ML fundamentals (MDL) are still there!

High redundancy inside neural networks

MDL principle is lost? number of parameters = huge, but:

Back to the original formulation of MDL

Actually, AIC and BIC are just approximations of MDL, not valid here

Architectures as priors on function space, initializations as random nonlinear projections

Cf Chapter 2 on architectures.

Additional reference:

A good read / source of references: "Deep Learning" book by Ian Goodfellow, Yoshua Bengio and Aaron Courville







Back to the main page of the course

Valid HTML 4.0 Transitional