$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\R}{\mathbb{R}}$

Chapter 7: Small data, weak supervision, robustness / Modeling: incorporating priors or physics

NB: turn on javascript to get beautiful mathematical formulas thanks to MathJax

$\newcommand{\epsi}{\varepsilon}$

Overview:

I - Small data, weak supervision, robustness II - Modeling: incorporating priors or physics

I - Small data, weak supervision, robustness

Data augmentation, synthetic data, invariances

→ Edouard's slides
related exercise

II - Modeling: incorporating priors or physics

This section is about modeling the task: exploiting known invariances, priors or physical properties

[Curriculum learning; Y Bengio, J Louradour, R Collobert, J Weston; ICML 2009]
• Main idea: not harder "tasks", but harder "examples": pick gradually harder and harder examples, i.e. make the probability of picking hard examples to increase with time
• Formalization: instead of training directly on the distribution $p$ of all examples, train on a distribution $q_t$ varying with time and including gradually harder examples more frequently:
• at time $t \in [0,1]$, consider the distribution $q_t(z) = w_t(z) p(z)$ over all examples $z$
• with $w_t$ increasing with $t$, for each $z$ (i.e., starting from few examples and adding some with time)
• and entropy $H(q_t)$ increasing with $t$ (i.e., more and more diversity)
• and $w_t = 1$ for all $z$, i.e. final distribution $q_1(z) =$ target $p(z)$
• NB: notion of "easy"/"hard" example = defined by hand by the user
• Experiments:
• classifying geometrical shapes (triangles, rectangles...) from easy shapes to more complex ones
• NLP task with growing vocabulary
$\implies$ notice faster & better convergence

Example of invariance enforcement by design: permutation invariance

Deep Sets

[Deep Sets; Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, Alexander Smola; NIPS 2017]

Examples of applications: when dealing with unordered set of points

• PointNet / PointNet++ : 3D point cloud classification / segmentation (e.g., shapes obtained by laser measurements)
• mini-batch GAN (see above)
• deep genetics

Choosing physically meaningful metrics

Optimal transport (Sinkhorn approximation)

• [Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances; Marco Cuturi; NIPS 2013]
and in primal form: [Learning Generative Models with Sinkhorn Divergences; Aude Genevay, Gabriel Peyré, Marco Cuturi; AISTATS 2018]
• Goal: to compare output distribution $p$ with desired (ground truth) one $q$
• Optimal transport: a distance between distributions, quantifying how much work is needed to transform one distribution into the other one. Analogy: moving earth with a shelve, minimizing the energy required, taking into account the mass and the distance to be transported.
$\implies$ better distance than $L_2$ norm between distributions... or than $KL$ (the Kullback-Leibler divergence doesn't handle geometry: exact target bin, not around: no notion of neighborhood or distance between bins)
• Optimal transport formulation: $$\inf_{\text{transport fields} \;w\; \text{matching} \; p\; \text{and}\; q} \int_{x,y} w(x,y) C(x,y) dx dy$$ where $C(x,y)$ is the cost of moving some unit mass from point $x$ to point $y$,
where $w(x,y)$ is the quantity of mass transported from point $x$ to point $y$,
and where the conditions on the transport $w$ write:
• reach the target distribution: $\forall y, \;\; \int_x w(x,y) dx = q(y)$
• start from the candidate distribution: $\forall x,\;\; \int_y w(x,y) dx = p(x)$
then return the total transportation cost

• Issue: closed form solution in 1D space, but too expensive to compute in higher dimensional space $\implies$ approximation schemes; in particular, approximation by Sinkhorn divergences
• in both dual/primal formulations: if distributions are sets of points of identical masses (Dirac peaks), then solving the optimal transport problem is equivalent to searching for a bistochastic (*) transport matrix $W$ minimizing $W\,.*\;C$
(*) bistochastic = all lines and rows sum up to 1
• approximation: add a + $\epsi\, \text{entropy}(W)$ cost, making the problem convex (unique solution)
• $W$ is necessarily of the form $\text{diag}(u) K \text{diag}(v)$, with $K = e^{-C/\epsi}$ and only one matrix satisfies this
• Sinkhorn-Knopp algorithm [1967]: $W$ is obtainable by the iterative algorithm (proved to converge): $$u \;\; ./\!= \;\; K v$$ $$v \;\; ./\!= \;\; K u$$ starting from $u$ and $v = (1 1 1 1\dots)$ and at convergence return: $$u\; (K \; .* \; C) \; v$$ as $\sum_{ij} K_{ij} u_i v_j$ stands for $\int_{xy} w(x,y) dx dy$ and $C_{ij}$ stands for the transport cost $C(x,y)$.
• this is implementable with a (fixed in advance) number of simple neural network layers, as all operations are differentiable

• thus the (approximated) optimal transport cost $\text{OT}(p,q)$ can be used in an optimization process such as backpropagation: $\displaystyle\;\; \min_f \sum_{\text{examples}\; k} \text{OT}( f(\text{ex}_k), \text{target} \; q_k$ ).

Metric learning: Siamese networks

[Signature Verification using a "Siamese" Time Delay Neural Network; Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sickinger and Roopak Shah; NIPS 1994]
• estimate the best metric (e.g. for some classification task), i.e. for instance, a mapping to another space, in which the Euclidean distance will be relevant
• if desired distances (between given points) or clusters are known, define a loss on triplets $(x,p,n)$, such as $$\sum_{(x,p,n)} max(0, d(x,p) - d(x,n) + m)$$ where
• $x$ : current sample,
• $p$ : positive example : $d(x,p)$ should be small
• $n$ : negative example : $d(x,n)$ should be large
• $m$ : arbitrary margin
The optimization of such a loss will push together positive samples with $x$ and push away negative ones from $x$.
NB : without margin and clamping, the optimization would provably diverge.
• named Siamese because the network $\varphi$ needs to be applied twice (to $x$ and $p$, e.g.) in order to be able to copmute the (Euclidean) distance in the output space (between $\varphi(x)$ and $\varphi(p)$).

Noisy data (can be seen as noise modeling)

Denoising auto-encoder

[Extracting and composing robust features with denoising autoencoders; Vincent, H. Larochelle Y. Bengio and P.A. Manzagol; ICML 2008]
• originally meant to robustify auto-encoders (without requiring few middle neurons)
• get noisy inputs (in the article, artificial additional noise), ask for reconstruction without noise
• learns to get robust features and to denoise

What if noise is already present in the dataset, but unknown? (no denoised target available)

• data self-realignment : possible?
• example: registering RGB satellite images to cadaster maps (binary pictures indicating building presence): no exact ground truth available (spurious deformations in data acquisition & elevation landscape / human mistakes in making cadasters)
• in the case of a regression task : $x \mapsto y$ with noisy dataset samples $(x,y)$ :
• if output noise (on $y$) is ~i.i.d. (does not depend on x)
or more exactly: if "relatively similar" $x, x'$ do not share more similar noise on $y, y'$
then no bias in the dataset
• if loss = $L_2$ metric : $\sum_i \| \hat{y}(x_i) - y_i \|^2$ with $y_i = y_i \;\text{real}$ + unknown $\text{noise}_i$
then, with the following assumptions:
• no noise bias (condition above on similar $x$)
• no overfit
• dense sampling limit (each sample has many neighbors)
the optimal $\hat{y}(x_i)$ is reached at $y_i$ real!
• in practice: denoise the dataset this way (here, re-align) and then re-learn from the aligned dataset (as less noise $\implies$ better results [theoretically same global optimum, but with different confiedences (= variance)])

Incorporating physical knowledge: Example of learning dynamics equations

Example: weather forecast, with training data

Range of possibilities: from using a pre-existing model to learning everything from scratch

Data assimilation

Equation known, just a few coefficients to estimate, or conditions to re-estimate regularly (e.g., if chaotic behavior)
$\implies$ statistical estimates, Kalman filter

Learning a Partial Differential Equation (real equation not known)

• genetic optimization (of possible terms to incorporate in the equation)
• a PDE is an iterative process $\implies$ model it as a recurrent network
• going further: not using a RNN, but using an Ordinary Differential Equation (ODE) solver $\newcommand{\dd}{\partial}$
[Neural Ordinary Differential Equations; Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud; best paper at NeurIPS 2018]
• given $f_\theta$, which describes the dynamics by $\frac{dh}{dt} = f_\theta\left(h_t\right)$, and given an initial condition $h_{t_0}$, use an ODE solver to compute $h_t$ at any real continuous time $t > t_0$
$\implies$ more precise than a RNN with discretized time steps
• the dynamics might depend also on $t$, using $\frac{dh}{dt} = f_{\theta, t} \left(h_t\right)$, for instance using a neural network $f_\theta$ with two inputs: $t$ and $h_t$
• given a loss to be minimized $L(h_{T})$ with respect to the dynamics $f_{\theta,t}$, one can similarly use an ODE solver to compute $\nabla_\theta L(h_T)$:
• define the "adjoint" $a(t) = \frac{\dd L}{\dd h_t}$, which tells how the state $h_t$ should change at time $t$ to decrease the loss
• using the chain rule (as for backpropagation), this adjoint satisfies: $$\frac{d a(t)}{dt} = - a(t)^\top \, \frac{\dd f_{\theta,t}( h_t )}{\dd h_t}$$
• and then the total variation with respect to the parameters $\theta$ can be expressed by integration: $$\frac{dL}{d\theta} = - \int_{t_1}^{t_0} a(t)^\top\,\frac{\dd f_{\theta,t}( h_t )}{\dd\theta}\,dt$$
• this integral can be computed using an ODE solver...
• thus one can learn the dynamics $f_{\theta,t}$
• example: complete a temporal series $(x_0, x_1, x_2 \dots x_N)$:
• consider that the dynamics learned / to be learned do not act directly on $x_t$ but on an internal state $h_t$ (in order to be able to deal with partial or noisy information, etc.)
• so first, transform this sequence into a guessed initial internal state $\hat{h}_{t=0}$ (named $z_0$ in the figure below), using an RNN (going reverse in time, from $x_N$ to $x_0$; the RNN has to be learned as well)
• then apply the dynamics $f_{\theta,t}$ previously learned, to $\hat{h}_0$ (using an ODE solver)
• this gives guessed states $\hat{h}_t$ for all times, in particular $\hat{h}_{t>N}$
• from them, deduce missing observations $\hat{x}_{t>N}$

• choosing relevant sensors and mixing information

Incorporating invariances/symmetries of the problem

• eg: translation / rotation / permutation... or symmetry between variables (shared blocks of networks)

Knowing an equation that the solution has to satisfy : solving PDEs!

[DGM: A deep learning algorithm for solving partial differential equations; Justin Sirignano and Konstantinos Spiliopoulos; Journal of Computational Physics 2018]
• equation to solve, of the form:  $\dd_t\, u(t,x) + L\, u(t,x) = 0 \;\;\;\; \forall t, x \in [0,T]\times\Omega$ with constraints $u(0,x) = u_0(x) \;\;\;\; \forall x \in \Omega\;\;\;$ [initial condition] and $u(t,x) = g(t,x) \;\;\;\; \forall x \in [0,T] \times \dd\Omega\;\;\;$ [boundary condition: control]

where:
• one searches for a function $u$ defined over a spatio-temporal domain $\Omega\times[0,T]$
• $L$ is a differential operator, not necessarily linear, applied to function $u$
• $\dd_x$ denotes the partial derivative with respect to $x$
• $\dd\Omega$ denotes the boundary of domain $\Omega$

• minimize $J(f) = \| \dd_t f + L f\|^2_{[0,T]\times\Omega} + \|f-g\|^2_{[0,T]\times\dd\Omega} + \|f(0,.) - u_0\|^2_\Omega$
• operators such as $\dd_t$ or $L$ are easy to implement on top of the network providing $f$
• proved (under assumptions) and checked experimentally to converge to 0 (and $f \to$ solutions)
• in practice: replace integral with Monte Carlo estimates (i.e. $\E_{\text{batch samples}}$ ...) and replace second-order derivatives $\dd_{x_1} \dd_{x_2} f(t,x)$ with approximations of type $\frac{ \dd_x f(t,x+\epsi) - \dd_x f(t,x) }{\epsi}$.

Important example in chemistry

[Machine Learning of coarse-grained Molecular Dynamics Force Fields; Jiang Wang, Christoph Wehmeyer, Frank Noe', Cecilia Clementi]
• Finding or approximating quantum mechanics equations' solutions is a very important topic in chemistry, as these equations are hard to solve (very heavy computations) and molecule properties depend on them
• (energy) potential function $U(x)$ around a molecule $x$: hard to model / compute
• with ML : train $x \mapsto U(x)$ from simulations
• desired physical property: $- \dd_x U(x) = \text{ForceField}(x)$
• minimize $\| \dd_x U(x, \theta) + \text{ForceField}(x)\|^2$ w.r.t. $\theta$ (parameters of the neural network standing for $U$)

Form of the solution known

[Deep Learning for Physical Processes: Incorporating Prior Scientific Knowledge; Emmanuel de Bezenac, Arthur Pajot, Patrick Gallinari; 2017]
• goal: prediction of the evolution of temperature maps (sea surface temperature, from satellite imagery)
• equation satisfied by the fluid (the ocean streams): advection-diffusion: $$\dd_t I + w \cdot \dd_x I = D\, \dd^2_{xx} I$$ ($\dd^2_{xx}$ = Laplacian, $D$ = diffusion constant, $w$ = unknown motion field)
• theorem: the solution of this equation is of the form: $$I(x,t) = \int_{\R^2} \frac{1}{4\pi D\,t} e^{- \frac{\|x-y-w\|^2 }{4Dt}} I_0(y) dy$$
• instead of predicting directly the new state at $t+1$ by learning $I(.,t) \mapsto I(.,t+1)$ directly, learn to predict the motion: $I(.,t) \mapsto w(.,t)$ from which one can estimate $I(.,t+1)$
• criterion to optimize: good prediction of $I(t+1)$ + regularization of $w$ (smooth, small, small divergence)

$\implies$ better results than direct prediction; this said, the same team, in a later paper, without using the solution form, but using an ODE solver, obtained better results [Learning Dynamical Systems from Partial Observations; Ibrahim Ayed, Emmanuel de Bézenac, Arthur Pajot, Julien Brajard, Patrick Gallinari; 2019]

In practice

Learn from real samples (usually few) or from simulations (goal: improve simulation speed, or learn where to focus for good accuracy...)

Deep for physic dynamics : learning and controlling the dynamics

Consider a dynamical system:
• $x_{t+1} = F(x_t)$ with fixed, unknown function $F$, and $x_t \in \R^d$
• if $F$ is linear: "easy" (or, at least, a classic problem) to retrieve from data samples $(x_t, x_{t+1}, \dots)$ as a mean square problem $\displaystyle \sum_{\text{samples} \; x} \|x_{t+1} - A\, x_t\|^2$
• solution: note $X = [x_1, x_2, ... x_T]$
and $Y = [x_2, x_3, ... x_{T+1}]$
then the matrix $Y\, X^\dagger \to$ optimal $A$ when $T$ increases
where $X^{\dagger}$ is the Moore-Penrose pseudo-inverse of $X$, which is $\displaystyle \;\;\lim_{\epsi\to 0}\;\; (X^* X + \epsi\, \mathrm{Id})^{-1} X$
• what if $F$ non linear?

Koopman - von Neumann classical mechanics using a quantum mechanics formalism:
• Theorem: there exists a representation space in which any dynamical system becomes linear
$\implies$ approximate this representation with a neural network! and learn linear dynamics there
• unfortunately: this representation space = a complex Hilbert space (i.e., infinite-dimensioned vector space)

[Deep Dynamical Modeling and Control of Unsteady Fluid Flows; Jeremy Morton, Freddie D. Witherden, Antony Jameson, Mykel J. Kochenderfer; NIPS 2018]
• train a network $g:$  $X \to X'$ $Y \to Y'$
mapping to another space,
such that the dynamics in that space are linear, i.e. $Y' = A' X'$ for some matrix $A'$
and the inverse network $h = g^{-1} :$  $X' \to X$ $Y' \to Y$
.
Then, for new $X$, predict $h\big( A'( g(X) ) \big)$ [supposed to be close to $Y$]

• loss: $\|X - h(g(X))\|^2$ : reconstruction error of the auto-encoder $h \circ g$
$+\; \|Y - h(A'(g(X)))\|^2$ : prediction error
• NB: given $X'$ and $Y'$, the optimal matrix $A'$ (best linear dynamical model) can be computed in closed form as above for the linear case. The non-linearity of the dynamics in that space can be measured by how much $A'X'$ differs from $Y'$ for that best $A'$.
• idem for control

Cf also Steven Brunton's lab page, e.g.:
Active topic:
cf workshop : Machine Learning for Computational Fuild and Solid Dynamics

Back to the main page of the course