
# Chapter 2: Architectures

NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax
NB2: an old raw text file is also available if you wish for a more compact summary

Overview:

I - Architectures as priors on function space, initializations as random nonlinear projections II - Architectures

## I - Architectures as priors on function space, initializations as random nonlinear projections

• classical ML : design features by hand, vs. deep learning : meta-design of features : design architectures that are likely to produce features similar to the ones you would have designed by hand
• an architecture = a parameterized family (by the neural network's weights)
• still a similar optimization problem (find the best parameterized function), just, in a much wider space (many more parameters)

### Architecture = prior on the function

• prior :
• as a constraint: what is expressible or not with this architecture?
• but most networks already have huge capacity (expression power) $\implies$ expressivity = most often not useful
• $\implies$ probabilistic prior: what is easy to reach or not?
• good architecture : with random weights, already good features (or not far) [random = according to some law, see later]
• in classical ML, lots of works on random features + SVM on top (usually from a kernel point of view); random projections
• in some cases at least (a-few-layer deep networks?), most of the performance is due to the architecture, not to the training!: fitting just a linear classifier on top of the network with random weights doesn't decrease much the accuracy, when compared to training fully the same network: [On Random Weights and Unsupervised Feature Learning, Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y. Ng, ICML 2011] (old: 5-layer LeNet architecture, but to be confirmed for deeper architectures such as VGG)
• $\implies$ training whole or keeping random "non-linear projections": can be seen as choice of optimization order or speed (train last layer first)
• Extreme Machine Learning (EML) = learn only the last layer (not proved to work on "very" deep architectures such as modern networks)
• NB: layers of networks with random weights might keep more information about the input than ones trained for classification (as not all of the information is useful for classification). Visualization of the quality of random weights in known architectures by reconstructing the input, from the k-th layer activations [VGG with random weights]: no big information loss: [A Powerful Generative Model Using Random Weights for the Deep Image Representation, Kun He, Yan Wang, John Hopcroft, NIPS 2016]
• To go further on information flow inside neural networks: see Information Bottleneck (chapter on Visualization)
• On deep architectures: learn a small part of the weights only : [Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing, Amir Rosenfeld, John K. Tsotsos]
• learn only a fraction of the features of each layer = sufficient to get good results (not very surprising)
• if one can learn only one layer, choose the middle one
• if you use random features: batch-norm is important (of course: the features should be "normalized" somehow at some point instead of making the function explode or vanish; cf next section)
• bias of the architecture (i.e. proba on possible functions to learn):

### Random initialization: random but according to a chosen law that induces good functional properties

• avoid exploding or vanishing gradient (e.g.: activities globally multiplied by a constant factor $\gamma$ > 1 or < 1 at each layer: would obtain a factor $\gamma^L \gg 1$ or $\ll 1$ after $L$ layers)
• Xavier Glorot initialization, or similar : [Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot, Yoshua Bengio, AISTATS 2010]
• multiplicative weights: uniform over $\left[- \frac{1}{\sqrt d}, + \frac{1}{\sqrt d} \right]$, or Gaussian $\mathcal{N}(0, \sigma^2 = \frac{1}{d})$ where $d$ = number of inputs of the neuron
• justification: if weights $w_i$ are i.i.d. of mean 0 and standard deviation $\frac{1}{\sqrt{d}}$ and inputs $x_i$ are i.i.d. of mean 0 and variance 1, the neuron will compute $\sum_i w_i x_i$, which will still be of mean $\E_x \E_w \sum_i w_i x_i = \E_w \sum_i w_i \times 0 = 0$ and of variance $\E_x \E_w \sum_i w_i^2 x_i^2 = \sum_i \E_w w_i^2 \E_x x_i^2 = \sum_{i=1}^d \frac{1}{d} \times 1 = 1$. Thus, the output of the neuron will follow the same law as its inputs (mean 0, variance 1). And consequently, if one stacks several such layers, the output of the network will still be of mean 0 and variance 1.
• multiply by an activation-function-dependent factor if needed (eg: ReLU divides variance by 2, since it is 0 over half of the input domain; so, correct by $\times \sqrt 2$)
• biases: 0
• [Which neural net architectures give rise to exploding and vanishing gradients? NIPS 2018, Boris Hanin]
• previous paragraph: ensures that, at initialization, $f(x)$ is in a reasonable range (notations: $x$ = input, $f$ = function computed by the neural network)
• here: check also the Jacobian $\frac{df}{dx}$ at initialization
• turns out that, with $n_l$ = "number of neurons in layer $l$":
• variance$\left(\frac{df}{dx}\right)$ is fixed: $\frac{1}{n_0}$
• var$\left( \left(\frac{df}{dx}\right)^2\right)$ , i.e. the fourth moment $\E\left[ \left(\frac{df}{dx}\right)^4 \right]$ is lower- and upper- bounded by terms in $e^{\sum_{\mathrm{layers} \, l} \; \frac{1}{n_l}}$
$\implies$ $\sum \frac{1}{n_l}$ is an important quantity
$\implies$ avoid many thin layers; if on a neuron budget, choose equal size for all layers

### Designing architectures easy to train

• training "deep" networks : actually not really deep in terms of distance between any weight to tune and the loss information (in number of neurons to cross in the computational graph), for easier information communication (through the backpropagation)
• we saw that overparameterized networks (i.e. large layers) seem to be easier to train
• is depth required? Not an easy question. [Do Deep Nets Really Need to be Deep?, Lei Jimmy Ba, Rich Caruana, NIPS 2014] : learning a shallow network doesn't work; learning a deep one works; training a shallow one to learn the deep one works, and possibly with no more parameter than the deep one
• $\implies$ "good architecture" is about "more likely to get the right function"; better, future optimizers might make smaller architectures more attractive

## II - Architectures

NB: this is about current most popular architectures, not to be taken as an exhaustive, immutable list
$\implies$ "deep learning" is moving towards general "differentiable programming", with more flexible architectures/approaches every year: any computational graph provided all operations are differentiable.

### Architecture zoo reminder (CNN, auto-encoder, LSTM, adversarial...)

CNN : reducing parameters by sharing local filters + hierarchical model ($\implies$ invariance to translation, + much greater generalization power [cf Chapter 1])
• needs to be interleaved with "zooming-out" operations (such as max-poolings) to be able to get wide enough receptive field (otherwise, won't see the whole object)
• typical conv block: conv ReLU conv ReLU max-pool with conv 3x3 or so
• NB: do not use large filters: better rewrite 15x15 as a hierarchical series of 3x3 filters: though the expressivity is similar, the probabilities are different, e.g. typical Fourier spectrum is different

Auto-encoder, VAE, GAN :

Recurrent networks

### Dealing with scale & resolution

Classification/generation of high-resolution images: pyramidal approaches (e.g. any conventional conv-network for ImageNet / reverse pyramid for image generation)

Same, full resolution for input and output simultaneously: (example: image segmentation)

### Dealing with depth and mixing blocks

Training deep : not really deep in terms of distance to the loss information (except for orthonormal matrices...) (cf section above)

Inception [v1: Going deeper with convolutions, C Szegedy et al, CVPR 2015]
• chain of blocks, each of which = parallel blocks whose outputs are concatenated (later turned in a ResNet form)
• philosophy: hesitation in the design? shall we use 3x3 or 4x4 conv in this block? Well, try all possibilities in parallel, and let the training handle and pick what is useful.
• auxiliary losses at intermediate blocks to help training (same idea: to give better information from a loss backpropagation to early layers as well)

DenseNet [Densely Connected Convolutional Networks, Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, CVPR 2017]
• series of blocks, each of them = feed-forward network of a few layers with dense connections between layers (i.e. each layer receives information from all previous ones in its block)
• philosophy: not plain straight-forward neural network, but mix of information at different levels; hesitation in the design? let the network handle and pick the connections that are useful
• as in particular within each block the two extreme layers are directly connected, the shortest path for the backpropagation to reach any layer is short (blocks are fast to go through)

### Examples (case study)

What architecture would you propose for:
• Video analysis/prediction
• auto-encoder + LSTM = conv-LSTM
• Aligning satellite RGB pictures with cadaster maps (binary masks indicating buildings, roads, etc.: one label per pixel), i.e. estimating the (spatial) non-rigid deformation between the two images
• full resolution output required (2D vector field), multi-scale processing $\implies$ very deep network to train

### Attention mechanisms, R-CNN

Basics of attention:
• possible goals:
• transistor: transfer one variable ($a$) or another one ($b$) depending on the context (switching variable $\alpha$)
• combine two variables $a$ and $b$ by summing them, but with weights $(\alpha, 1-\alpha)$ that depend on some context $c$
• examples of context: $a$ and $b$ themselves, or other features produced by the neurons/layers that produced $a$ and $b$
• how:
• given two real variables $a$ and $b$ (or vectors of same dimension, actually)
• create a a weighting variable $\alpha \in [0,1]$ (e.g. with a softmax) that depends on some context $c$
• return $\alpha a + (1-\alpha) b$ with $\alpha = \alpha(c)$
• → cf MinimalRNN/LSTM where it is used to control the amount of update/forget at each step

Case of many variables influencing each other [Attention is all you need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, NIPS 2017]
• Principle:
• want to combine 'values' $V_i$, each of which is introduced with a 'key' $K_i \in \R^d$ (thought of as a descriptor of the context);
• the way to combine these values depend on the mood of the operator and on how it likes the various 'keys';
i.e. given a 'query' $Q \in \R^d$, one will search for the most suitable 'keys', i.e. the $K_i$ most similar to $Q$, and combine the associated 'values' $V_i$ accordingly:
return $\sum_i w_i V_i$ with weights $w_i$ that somehow depend on the similarity between $Q$ and $K_i$, which will be quantified as the inner product $(Q \cdot K_i)$.
• example in NLP: translation of a sentence to a different language: at every step, each word has a certain number of features to describe it; how to incorporate information from other words? All words should not influence every other word, but influence should be selective (close semantic, or close grammatical relationship [subject/verb], etc.). Each word in turn will be the query $Q$, while all other words are the (value $V_i$, key $K_i$) pairs.
• attention block: function(query $Q$, keys $(K_i)$, values $(V_i)$):
Q $\mapsto \sum_i w_i V_i$
with $\sum_i w_i = 1$ : softmax of the similarities $(Q \cdot K_i)$
i.e. pick values of similar keys (similarity being defined as correlation in $\R^d$)
• more exactly: normalize $(Q \cdot K_i)$ by $\sqrt{d}$ before applying softmax, where d = length(query) = length(key), for better initialization/training, as $(Q \cdot K_i)$ is expected to be of the order of magnitude of $\sqrt{d}$ (same spirit as Xavier Glorot's initialization)
• NB: 'values' $V_i$ can be real values, but you can consider also vectors in $\R^p$...
• "multi-head" block : apply several attention modules (with different keys) and concatenate their outputs $\implies$ allow to assemble different parts of the 'value' vectors

Application of attention to features / blocks : [Squeeze-and-Excitation Networks, Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu, CVPR 2017, https://arxiv.org/abs/1709.01507]
• attention on features of a CNN / blocks in an Inception / features in a ResNet block / etc.
• principle: each block produces many features; let's focus on the features that seem to be important for our particular input image.
• for this: multiply all activities at the output of a block by a feature-dependent factor, in a way that depends on the current context (all block activities, summarized [i.e. averaged over all pixel locations]).

R-CNN : Region-CNN

### "Memory"

Not really working yet.
• most known papers:
• attention mechanism to know when/where to write/read numbers in the memory
$\implies$ reading = softmax over memory values, with weights depending on the weights for the "address"

### GraphCNN

Principles:
• a graph with values on nodes (and/or edges) is given as input
• each layer computes new values for nodes / edges, as a function of node neighborhoods
• same function for all nodes (neighborhoods) : kind of "convolutional"
• new node value may depend on edge values also (kind of attention)
• idem for edge values (function of node values, edges...)
• stack as many layers as needed
• max-pooling: means coarsening the graph (deleting nodes)
• etc.
Related literature:

Relation Networks

PixelRNN [Pixel Recurrent Neural Networks, Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, NIPS 2016]
• principle: process pixels sequentially (1D ordering) instead of simultaneously
• define "context" of a pixel = all predictions already made for previous pixels
• re-define this in a multi-scale way
• concatenate all contexts and make prediction for the current pixel

Wavenet : to deal with time in a hierarchical manner [WaveNet: A Generative Model for Raw Audio, Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, 2016]
→ stack of dilated causal convolution layers (+ ResNet attention)

[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019]
→ stacked attention modules in a RNN (so: when a new input arrives, each of them is applied once, in series), with attention performed on the previous layer at all previous times.

NB: all LSTM / all conv / all attention : it all works (with proper design, initialization and training) and better than the other, previous methods (according to the papers)
→ e.g.: [Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, NIPS 2017]

Executing $n$ steps after each new input before reading the next one (when applying a RNN), with variable $n$ [Adaptive Computation Time for Recurrent Neural Networks, Alex Graves, 2016, unpublished]

Social LSTM [Social LSTM: Human Trajectory Prediction in Crowded Spaces, Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, Silvio Savarese, CVPR 2016]
• video analysis with several objects moving: one LSTM per object
• interaction between objects: add communication in the graph of LSTMs, only between nearest-neighboring objects at the current frame

### Other important pieces of design

Loss design:
• How is it that, when tackling a classification task, what we want is to obtain the best accuracy, but what we do is optimizing the cross-entropy (which is not the same criterion)? And why does it work?
• What is important is the optimization properties of the loss [cf http://cs231n.github.io/neural-networks-2/ ]

Activation functions:
• many activation functions can do a relatively similar job
• but details and properties may vary
• example: max-pool → global average pooling → ranking/softmax (to use/train all regions)

Layer size, feature size (number of neurons and/or features)

### Opening

Hyperparameter tuning (architecture + optimization parameters)
$\implies$ auto-DL (Chapter 8)

Back to the main page of the course