I - Architectures as priors on function space, initializations as random nonlinear projections

- Change of design paradigm
- Architecture = prior on the function
- Random initialization: random but according to a chosen law that induces good functional properties
- Designing architectures easy to train

- Architecture zoo reminder (CNN, auto-encoder, LSTM, adversarial...)
- Dealing with scale & resolution
- Dealing with depth and mixing blocks
- Examples (case study)
- Attention mechanisms, R-CNN
- "Memory"
- GraphCNN
- Others / advanced
- Other important pieces of design
- Opening

- classical ML : design features by hand, vs. deep learning : meta-design of features : design architectures that are likely to produce features similar to the ones you would have designed by hand
- an architecture = a parameterized family (by the neural network's weights)
- still a similar optimization problem (find the best parameterized function), just, in a much wider space (many more parameters)

- prior :
- as a constraint: what is expressible or not with this architecture?
- but most networks already have huge capacity (expression power) $ \implies $ expressivity = most often not useful
- $ \implies $ probabilistic prior: what is easy to reach or not?

- good architecture : with random weights, already good features (or not far) [random = according to some law, see later]
- in classical ML, lots of works on random features + SVM on top (usually from a kernel point of view); random projections

- in some cases at least (a-few-layer deep networks?), most of the performance is due to the architecture, not to the training!: fitting just a linear classifier on top of the network with random weights doesn't decrease much the accuracy, when compared to training fully the same network: [On Random Weights and Unsupervised Feature Learning, Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y. Ng, ICML 2011] (old: 5-layer LeNet architecture, but to be confirmed for deeper architectures such as VGG)
- $ \implies $ training whole or keeping random "non-linear projections": can be seen as choice of optimization order or speed (train last layer first)
- Extreme Machine Learning (EML) = learn only the last layer (not proved to work on "very" deep architectures such as modern networks)
- NB: layers of networks with random weights might keep more information about the input than ones trained for classification (as not all of the information is useful for classification). Visualization of the quality of random weights in known architectures by reconstructing the input, from the k-th layer activations [VGG with random weights]: no big information loss: [A Powerful Generative Model Using Random Weights for the Deep Image Representation, Kun He, Yan Wang, John Hopcroft, NIPS 2016]
- To go further on information flow inside neural networks: see Information Bottleneck (chapter on Visualization)

- On deep architectures: learn a small part of the weights only : [Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing, Amir Rosenfeld, John K. Tsotsos]
- learn only a fraction of the features of each layer = sufficient to get good results (not very surprising)
- if one can learn only one layer, choose the middle one
- if you use random features: batch-norm is important (of course: the features should be "normalized" somehow at some point instead of making the function explode or vanish; cf next section)

- bias of the architecture (i.e. proba on possible functions to learn):
- example: Fourier spectrum bias [On the spectral bias of neural networks, Rahaman & al, NIPS 2018] : lower frequencies (i.e. big objects) are learned first when training typical (hierarchical) CNNs.

- avoid exploding or vanishing gradient (e.g.: activities globally multiplied by a constant factor $\gamma$ > 1 or < 1 at each layer: would obtain a factor $\gamma^L \gg 1$ or $\ll 1$ after $L$ layers)
- Xavier Glorot initialization, or similar : [Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot, Yoshua Bengio, AISTATS 2010]
- multiplicative weights: uniform over $ \left[- \frac{1}{\sqrt d}, + \frac{1}{\sqrt d} \right] $, or Gaussian $ \mathcal{N}(0, \sigma^2 = \frac{1}{d}) $ where $d$ = number of inputs of the neuron
- justification: if weights $w_i$ are i.i.d. of mean 0 and standard deviation $\frac{1}{\sqrt{d}}$ and inputs $x_i$ are i.i.d. of mean 0 and variance 1, the neuron will compute $\sum_i w_i x_i$, which will still be of mean $\E_x \E_w \sum_i w_i x_i = \E_w \sum_i w_i \times 0 = 0$ and of variance $\E_x \E_w \sum_i w_i^2 x_i^2 = \sum_i \E_w w_i^2 \E_x x_i^2 = \sum_{i=1}^d \frac{1}{d} \times 1 = 1$. Thus, the output of the neuron will follow the same law as its inputs (mean 0, variance 1). And consequently, if one stacks several such layers, the output of the network will still be of mean 0 and variance 1.
- multiply by an activation-function-dependent factor if needed (eg: ReLU divides variance by 2, since it is 0 over half of the input domain; so, correct by $\times \sqrt 2$)
- biases: 0

- Xavier Glorot initialization, or similar : [Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot, Yoshua Bengio, AISTATS 2010]
- [Which neural net architectures give rise to exploding and vanishing gradients? NIPS 2018, Boris Hanin]
- previous paragraph: ensures that, at initialization, $f(x)$ is in a reasonable range (notations: $x$ = input, $f$ = function computed by the neural network)
- here: check also the Jacobian $\frac{df}{dx}$ at initialization
- turns out that, with $n_l$ = "number of neurons in layer $l$":
- variance$\left(\frac{df}{dx}\right)$ is fixed: $\frac{1}{n_0}$
- var$\left( \left(\frac{df}{dx}\right)^2\right)$ , i.e. the fourth moment $\E\left[ \left(\frac{df}{dx}\right)^4 \right]$ is lower- and upper- bounded by terms in $e^{\sum_{\mathrm{layers} \, l} \; \frac{1}{n_l}}$

$ \implies $ $\sum \frac{1}{n_l}$ is an important quantity

$ \implies $ avoid many thin layers; if on a neuron budget, choose equal size for all layers

- training "deep" networks : actually not really deep in terms of distance between any weight to tune and the loss information (in number of neurons to cross in the computational graph), for easier information communication (through the backpropagation)
- ResNet architectures (see next chapter)
- or add intermediate losses (idem)
- only exception: using orthonormal weight matrices

- we saw that overparameterized networks (i.e. large layers) seem to be easier to train
- is depth required? Not an easy question. [Do Deep Nets Really Need to be Deep?, Lei Jimmy Ba, Rich Caruana, NIPS 2014] : learning a shallow network doesn't work; learning a deep one works; training a shallow one to learn the deep one works, and possibly with no more parameter than the deep one
- $ \implies $ "good architecture" is about "more likely to get the right function"; better, future optimizers might make smaller architectures more attractive

NB: this is about current most popular architectures, not to be taken as an exhaustive, immutable list

$ \implies $ "deep learning" is moving towards general "differentiable programming", with more flexible architectures/approaches every year: any computational graph provided all operations are differentiable.

- needs to be interleaved with "zooming-out" operations (such as max-poolings) to be able to get wide enough receptive field (otherwise, won't see the whole object)
- typical conv block: conv ReLU conv ReLU max-pool with conv 3x3 or so
- NB: do not use large filters: better rewrite 15x15 as a hierarchical series of 3x3 filters: though the expressivity is similar, the probabilities are different, e.g. typical Fourier spectrum is different

- adversarial approach $ \implies $ no need to model the task anymore:

The generated image will be a good face if...- ... it has two eyes, of such color, size, one nose, etc.

vs. - ... if the discriminator doesn't make the difference with real data.

- ... it has two eyes, of such color, size, one nose, etc.
- cycle-GAN : when learning mappings between 2 domains (e.g., images of 2 different styles), ask that the two mappings (style 1 $\mapsto$ style 2, and style 2 $\mapsto$ style 1) be consistent (i.e. composition close to identity) [Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros]

- basic RNN : $h_{t+1} = f( h_t, x_t )\;\;$ (notations: $x_t$ = input at time $t$, $\;\;h_t$ = internal state at time $t$)
- issue: memory? (keep a description of the context: here $h_t$ is completely lost at time $t+1$!)
- get "leaky": progressive update of the memory: $h_{t+1} = h_t + f(h_t, x_t)$
- get "gated": i.e. make the update amount dependent on the context (cf "attention" later) : $h_{t+1} = \alpha h_t + (1-\alpha) f(h_t, x_t)$ with $\alpha = g(h_t, x_t) \in [0,1]$ (named
*forget gate*)

- building & explaining LSTM [Long Short-term Memory, Sepp Hochreiter; Jürgen Schmidhuber, Neural Computation 1997]

NB: a bit more complex that the formula above. Two hidden streams: $c_t$ and $h_t$. The real memory stream is $c_t$ (named $h_t$ in our formula), while $h_t$ is an auxiliary stream used to compute $\alpha$ and the update (in $g$ and $f$ above respectively). Also, $\alpha$ and $1-\alpha$ are decoupled and not required to sum up to 1. - GRU: simplified version [Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio, NIPS 2014]
- simplifying more? if update = f(last input only): MinimalRNN [MinimalRNN: Toward More Interpretable and Trainable Recurrent Neural Networks, Minmin Chen, Symposium on Interpretable Machine Learning at NIPS 2017]

- design of the "forget gate" justified by continuous analysis, with additional recommendations on bias initialization (equivalent to typical memory duration) as $b \sim -\log\left( \mathrm{Uniform}\left[1, T_\max\right] \right)$ where $T_\max$ is the maximum expected memory duration [Can recurrent neural networks warp time? Corentin Tallec, Yann Ollivier, ICLR 2018]

- for classification (ImageNet): e.g. LeNet, AlexNet, VGG... : general shape = pyramid : fine to coarse resolutions
- LeNet [eg: Handwritten digit recognition with a back-propagation network, Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, NIPS 1989]
- AlexNet [Imagenet classification with deep convolutional neural networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, NIPS 2012]
- VGG [Very Deep Convolutional Networks for Large-Scale Image Recognition, K. Simonyan, A. Zisserman, ICLR 2015]

- for image generation : general shape = reverse pyramid : coarse to fine resolutions
- e.g., one of the first realization with generative neural networks: [Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, NIPS 2015]
- example of the NVIDIA face generator (the old one): [Progressive Growing of GANs for Improved Quality, Stability, and Variation, Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, ICLR 2018]

- details should not be lost, while reasoning at high level (object recognition) required
- $ \implies $ fully-convolutional (i.e. only conv layers, no fully-connected ones), + auto-encoder shape (for high level reasoning), yet fine details lost:

- $ \implies $ keep detail information and re-use it when needed: fully-convolutional auto-encoder with "skip-connections" between encoding/decoding layers of same resolution = U-net [U-Net: Convolutional Networks for Biomedical Image Segmentation, Olaf Ronneberger, Philipp Fischer, Thomas Brox, MICCAI 2015] :

- avoid gradient vanishing/exploding : Glorot initialization (standard approach), or orthonormal weight matrices...
- chain of additions (ResNet) [Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, CVPR 2016]
- each block is of the form $h_{l+1} = h_l + g(h_l)$ instead of $h_{l+1} = g(h_l)$
- NB: each block = 2x conv-ReLU
- philosophy: each block adds to the previous ones, i.e. add fine corrections to what previous block did
- advantage: the output is of the form $f = \sum_l h_l$, so for any layer $l$, $\;\frac{d \mathrm{Loss}(f)}{d h_l} = \frac{\partial \mathrm{Loss}(f)}{\partial h_l} + \frac{\partial \mathrm{Loss}(f)}{\partial h_{l+1}} \frac{d h_{l+1}}{dh_l} + \dots $, so that there's a direct feedback to each layer, which helps optimization, especially at the beginning of the training.

- or chain of blocks initialized to Identity (Highway network; not as common in the literature)
- same as ResNet but with attention modules instead of additions (like LSTM, but non stationary) : $\alpha\, \mathrm{old} + (1-\alpha) \mathrm{new} $ with adaptive $\alpha$
- [Highway Networks, R. K. Srivastava, K. Greff and J. Schmidhuber, ICML 2015] → [Highway and Residual Networks learn Unrolled Iterative Estimation, Klaus Greff, Rupesh K. Srivastava, Jürgen Schmidhuber, ICLR 2017]

- Training a standard CNN with 10 000 layers is possible according to [Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks, Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington, ICML 2018]
- using orthogonal matrices (for initialization) in such a way that no explosion/vanishing can happen
- yet, validated on easy tasks only (MNIST, CIFAR 10); notice performance saturation with depth, and argue that architecture design = important, not just depth.

- chain of blocks, each of which = parallel blocks whose outputs are concatenated (later turned in a ResNet form)
- philosophy: hesitation in the design? shall we use 3x3 or 4x4 conv in this block? Well, try all possibilities in parallel, and let the training handle and pick what is useful.
- auxiliary losses at intermediate blocks to help training (same idea: to give better information from a loss backpropagation to early layers as well)

- series of blocks, each of them = feed-forward network of a few layers with dense connections between layers (i.e. each layer receives information from all previous ones in its block)
- philosophy: not plain straight-forward neural network, but mix of information at different levels; hesitation in the design? let the network handle and pick the connections that are useful
- as in particular within each block the two extreme layers are directly connected, the shortest path for the backpropagation to reach any layer is short (blocks are fast to go through)

- Video analysis/prediction
- auto-encoder + LSTM = conv-LSTM

- Aligning satellite RGB pictures with cadaster maps (binary masks indicating buildings, roads, etc.: one label per pixel), i.e. estimating the (spatial) non-rigid deformation between the two images
- full resolution output required (2D vector field), multi-scale processing $\implies$ very deep network to train

- possible goals:
- transistor: transfer one variable ($a$) or another one ($b$) depending on the context (switching variable $\alpha$)
- combine two variables $a$ and $b$ by summing them, but with weights $(\alpha, 1-\alpha)$ that depend on some context $c$
- examples of context: $a$ and $b$ themselves, or other features produced by the neurons/layers that produced $a$ and $b$

- how:
- given two real variables $a$ and $b$ (or vectors of same dimension, actually)
- create a a weighting variable $\alpha \in [0,1]$ (e.g. with a softmax) that depends on some context $c$
- return $\alpha a + (1-\alpha) b$ with $\alpha = \alpha(c)$

- → cf MinimalRNN/LSTM where it is used to control the amount of update/forget at each step

- Principle:
- want to combine 'values' $V_i$, each of which is introduced with a 'key' $K_i \in \R^d$ (thought of as a descriptor of the context);
- the way to combine these values depend on the mood of the operator and on how it likes the various 'keys';

i.e. given a 'query' $Q \in \R^d$, one will search for the most suitable 'keys', i.e. the $K_i$ most similar to $Q$, and combine the associated 'values' $V_i$ accordingly:

return $\sum_i w_i V_i$ with weights $w_i$ that somehow depend on the similarity between $Q$ and $K_i$, which will be quantified as the inner product $(Q \cdot K_i)$.

- example in NLP: translation of a sentence to a different language: at every step, each word has a certain number of features to describe it; how to incorporate information from other words? All words should not influence every other word, but influence should be selective (close semantic, or close grammatical relationship [subject/verb], etc.). Each word in turn will be the query $Q$, while all other words are the (value $V_i$, key $K_i$) pairs.

- attention block: function(query $Q$, keys $(K_i)$, values $(V_i)$):

Q $\mapsto \sum_i w_i V_i $

with $\sum_i w_i = 1$ : softmax of the similarities $(Q \cdot K_i)$

i.e. pick values of similar keys (similarity being defined as correlation in $\R^d$)

- more exactly: normalize $(Q \cdot K_i)$ by $\sqrt{d}$ before applying softmax, where d = length(query) = length(key), for better initialization/training, as $(Q \cdot K_i)$ is expected to be of the order of magnitude of $\sqrt{d}$ (same spirit as Xavier Glorot's initialization)
- NB: 'values' $V_i$ can be real values, but you can consider also vectors in $\R^p$...
- "multi-head" block : apply several attention modules (with different keys) and concatenate their outputs $\implies$ allow to assemble different parts of the 'value' vectors

- attention on features of a CNN / blocks in an Inception / features in a ResNet block / etc.
- principle: each block produces many features; let's focus on the features that seem to be important for our particular input image.
- for this: multiply all activities at the output of a block by a feature-dependent factor, in a way that depends on the current context (all block activities, summarized [i.e. averaged over all pixel locations]).

- papers: R-CNN, Mask R-CNN, Fast R-CNN, Faster R-CNN...
- detect zones of interest (rectangles), then, for each zone, rectify it, and apply a classification/segmentation tool on it

- [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, NIPS 2015]
- idem but with attention mechanism: classification features = already computed before ('values'); use same features (pipeline) for detection and classification/segmentation

- most known papers:
- [Memory networks, Jason Weston, Sumit Chopra, Antoine Bordes]
- [Neural Turing Machines, Alex Graves, Greg Wayne, Ivo Danihelka]
- Sparse Access Memory: [Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes, Jack W Rae, Jonathan J Hunt, Tim Harley, Ivo Danihelka, Andrew Senior, Greg Wayne, Alex Graves, Timothy P Lillicrap, NIPS 2016]

- attention mechanism to know when/where to write/read numbers in the memory

$\implies$ reading = softmax over memory values, with weights depending on the weights for the "address"

- a graph with values on nodes (and/or edges) is given as input
- each layer computes new values for nodes / edges, as a function of node neighborhoods
- same function for all nodes (neighborhoods) : kind of "convolutional"
- new node value may depend on edge values also (kind of attention)
- idem for edge values (function of node values, edges...)
- stack as many layers as needed
- max-pooling: means coarsening the graph (deleting nodes)
- etc.

- Graph-conv-nets
- [Convolutional Networks on Graphs for Learning Molecular Fingerprints, David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gomez-Bombarelli, Timothy Hirzel, Alan Aspuru-Guzik, Ryan P. Adams, NIPS 2015]
- [Semi-Supervised Classification with Graph Convolutional Networks, Thomas N. Kipf, Max Welling, ICLR 2017]

$\implies$ spectral view

- With attention: [Graph Attention Networks, Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, ICLR 2018]
- use node features to compute similarity between nodes (instead of values on edges)
- multi-head attention for node value update

- first detect objects, then apply a network to these descriptions, for easier reasoning at the object (interaction) level.
- SHRDLU new age: [A simple neural network module for relational reasoning, Adam Santoro, David Raposo, David G.T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap, NIPS 2017]

- Image segmentation: image input $\mapsto$ conv for object detection, then fully-connected between objects, then back to image with convnets

[Symbolic graph reasoning meets convolutions, X Liang, Z Hu, H Zhang, L Lin, EP Xing, NIPS 2018]

- principle: process pixels sequentially (1D ordering) instead of simultaneously
- define "context" of a pixel = all predictions already made for previous pixels
- re-define this in a multi-scale way
- concatenate all contexts and make prediction for the current pixel

→ stack of dilated causal convolution layers (+ ResNet attention)

[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019]

→ stacked attention modules in a RNN (so: when a new input arrives, each of them is applied once, in series), with attention performed on the previous layer at all previous times.

NB: all LSTM / all conv / all attention : it all works (with proper design, initialization and training) and better than the other, previous methods (according to the papers)

→ e.g.: [Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, NIPS 2017]

Executing $n$ steps after each new input before reading the next one (when applying a RNN), with variable $n$ [Adaptive Computation Time for Recurrent Neural Networks, Alex Graves, 2016, unpublished]

- video analysis with several objects moving: one LSTM per object
- interaction between objects: add communication in the graph of LSTMs, only between nearest-neighboring objects at the current frame

- How is it that, when tackling a classification task, what we want is to obtain the best accuracy, but what we do is optimizing the cross-entropy (which is not the same criterion)? And why does it work?
- What is important is the optimization properties of the loss [cf http://cs231n.github.io/neural-networks-2/ ]

- many activation functions can do a relatively similar job
- but details and properties may vary
- example: max-pool → global average pooling → ranking/softmax (to use/train all regions)

$\implies$ auto-DL (Chapter 8)

Back to the main page of the course