$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\R}{\mathbb{R}}$

Deep Learning in Practice

Chapter 2: Architectures


NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: an old raw text file is also available if you wish for a more compact summary

Overview:

I - Architectures as priors on function space, initializations as random nonlinear projections II - Architectures

I - Architectures as priors on function space, initializations as random nonlinear projections

Change of design paradigm

Architecture = prior on the function


Random initialization: random but according to a chosen law that induces good functional properties

Designing architectures easy to train



II - Architectures


Deep learning vs classical MK joke

NB: this is about current most popular architectures, not to be taken as an exhaustive, immutable list
$ \implies $ "deep learning" is moving towards general "differentiable programming", with more flexible architectures/approaches every year: any computational graph provided all operations are differentiable.

Architecture zoo reminder (CNN, auto-encoder, LSTM, adversarial...)

CNN : reducing parameters by sharing local filters + hierarchical model ($ \implies $ invariance to translation, + much greater generalization power [cf Chapter 1])
Auto-encoder, VAE, GAN :
Recurrent networks

Dealing with scale & resolution

Classification/generation of high-resolution images: pyramidal approaches (e.g. any conventional conv-network for ImageNet / reverse pyramid for image generation)
Same, full resolution for input and output simultaneously: (example: image segmentation)

Dealing with depth and mixing blocks

Training deep : not really deep in terms of distance to the loss information (except for orthonormal matrices...) (cf section above)
Inception [v1: Going deeper with convolutions, C Szegedy et al, CVPR 2015]
DenseNet [Densely Connected Convolutional Networks, Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, CVPR 2017]

Examples (case study)

What architecture would you propose for:

Attention mechanisms, R-CNN

Basics of attention:

Case of many variables influencing each other [Attention is all you need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, NIPS 2017]
Application of attention to features / blocks : [Squeeze-and-Excitation Networks, Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu, CVPR 2017, https://arxiv.org/abs/1709.01507]
R-CNN : Region-CNN

"Memory"

Not really working yet.

GraphCNN

Principles: Related literature:

Other / advanced

Relation Networks
PixelRNN [Pixel Recurrent Neural Networks, Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, NIPS 2016]
PixelRNN principle


Wavenet : to deal with time in a hierarchical manner [WaveNet: A Generative Model for Raw Audio, Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, 2016]
→ stack of dilated causal convolution layers (+ ResNet attention)

Wavenet architecture



[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019]
→ stacked attention modules in a RNN (so: when a new input arrives, each of them is applied once, in series), with attention performed on the previous layer at all previous times.

NB: all LSTM / all conv / all attention : it all works (with proper design, initialization and training) and better than the other, previous methods (according to the papers)
→ e.g.: [Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, NIPS 2017]

Executing $n$ steps after each new input before reading the next one (when applying a RNN), with variable $n$ [Adaptive Computation Time for Recurrent Neural Networks, Alex Graves, 2016, unpublished]

Social LSTM [Social LSTM: Human Trajectory Prediction in Crowded Spaces, Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, Silvio Savarese, CVPR 2016]

Other important pieces of design

Loss design:
Activation functions:
Layer size, feature size (number of neurons and/or features)

Opening

Hyperparameter tuning (architecture + optimization parameters)
$\implies$ auto-DL (Chapter 8)









Back to the main page of the course

Valid HTML 4.0 Transitional