Most exercises so far have been expessed as classical machine learning problems: given a task to perform and a dataset to learn from, find a suitable neural network architecture and train it to solve the task.

TP5 follows the same template. The task is to imitate a given dynamical system, namely, Rössler attractor. Slide 9 describes the possible approaches to define this task, and the goal of TP5 is to choose and solve one of these tasks.

1) First approach : full state known (second approach in slide 9 [bottom left])
====================================

1A) Dynamical system setup
--------------------------

We have a dynamical system of the form w_{t+1} = f(w_t)
where w = (x,y,z) are 3D coordinates
and where f is a function (from R^3 to R^3) which does not depend on t.
Given a location w_t at time t, this tells where the new location at t+1: w_{t+1} = f(w_t).
The goal is then to find this function f (i.e., to infer it from data samples).
The function is actually quite simple, so a simple classical feedfoward network (with inputs in R^3 and outputs in R^3) should be able to do the job.
To train the network, one has access to as many samples (w, f(w)) as one wants to, taken wherever one wants (i.e. you can choose the values w), by calling a simulator (namely, RosslerMap). In a real setting, one would not learn from a simulator, but from experiments.
One can then compare the trajectories obtained by the true Rössler attractor (RosslerMap.full_traj) with the ones generated by the network, i.e. (w, f(w), f(f(w)), f(f(f(w))), ...).
The errors made by the network will cumulate over time, i.e. even if f(w) is close to the true value, f(f(f(...(w)...))) might get quite far from the real value after a certain number of time steps.

More exactly, going from t to t+1 is too big a step (leading to a too difficult problem: where will the point be after 1 second?), so one might prefer to choose a smaller time step dt (for instance 0.001) and learn the function w_t --> w_{t+dt}.
Conveniently, one can provide the desired dt to the simulator, to build training data with the right time steps.
You will need to specify which value for dt you have chosen, so that when we generate trajectories with your code, we can deduce how many steps to perform to reach t = 10 from t = 0, for instance.


1B) Rössler attractor properties
--------------------------------

The true model (Rössler attractor) is actually chaotic, which means that any small change in the initial point w+dw will lead to a completely different trajectory after some time. The model learned by the neural network will thus not be able to be perfect, and we know at which speed the errors are amplified (on average when following a long trajectory): this is named the "Lyapunov exponent": for two very close initial points w_0 and w'_0, ||w_t - w'_t|| ~ exp(lambda t) ||w_0 - w'_0|| (in the asymptotic limit of high t).
It will not be possible to make predictions whose errors grow more slowly than exp(lambda t); the goal is rather to get as close as possible to this rate.

Rössler attractor is also ergodic, which means that any sufficiently long trajectory will go through all the possible states w of the system. Thus, learning from any single very long trajectory is theoretically similar to learning from many shorter trajectories taken at random locations ("random" not as in "uniform over R^3" but according to the probability of the dynamical system to get there, which is not given [but might be estimated]).


1C) Estimating the quality of the dynamical system learned, by statistics
-------------------------------------------------------------------------

When learning to imitate the dynamical system, one is not interested in just makeing sure that the prediction f(w) for any w is close to the true value, but rather that the dynamics on the long term (i.e., the trajectories) do look correct. To check that the dynamics are correct, one can look at the following quantities:
- histograms ("PDF": probability density function), as a function of x, y and/or z. That is, for your model and for the true one, are the probabilities to reach any location similar enough? 
- time correlations. Pick a relevant duration T and check the joint law of (w_t, w_{t+T}), for instance along a long trajectory. Does it look similar for the two models?
- move to the Fourier domain, which gives information about frequencies (which is relevant as the dynamical system follows globally a loop, though not exactly periodic). For a given long trajectory, are the frequencies observed similar for both models?
Three such quantities should be computed to get the marks for that question ("70%" in Slide 7); this list is not exhaustive, and other quantities relevant to the problem could be thought of instead.


1D) Estimating the quality (bis), with physical quantities
----------------------------------------------------------

Instead of just looking at global statistics over trajectories, one can wonder whether some key quantities with physical meaning are similar. Among possible physical quantities, for a dynamical system:
- equilibrium points, that is, points where the speed is 0 (i.e. f(w) = 0). In the case of the Rössler attractor, there is only one such point (and it is an unstable equilibrium: the center of the pseudo-circles). Is it at the right place?
- the Lyapunov exponents, which express the chaoticity of the system. Here, we are interested in the largest one, which is about 0.07.

In Slide 7 it is told that computing 2 such relevant quantities is worth 30% of the marks.
NB: the code to find the equilibrium point and the Lyapunov exponent is given in TP.py. However this code makes use of the Jacobian 'A' of the continuous case "dw/dt = g(w_t)" (cf Slide 8), so the code needs to be adapted to be used in this discrete time case "w_{t+dt} = f(w_t)". Indeed the Jacobian 'A' in the continuous case [which is used and required in the code] and the Jacobian of the discrete case 'J' are not the Jacobians of the same function (g vs. f); yet the relationship between g and f is very simple, as w_{t+dt} = w_t + dt * dw/dt + O(dt^2), which implies f = Id + dt g + O(dt^2), and so is the relationship between A = g' and J = f' : on first approximation, J = exp(A dt) ~ Id + A dt. So one can easily retrieve A from J.


1E) Which loss to choose?
-------------------------

To solve the problem, and obtain good statistics in 1C, just considering the loss sum_{samples w} ||f_true(w) - f_predicted(w)|| will probably not be sufficient. Hence Slide 10, that tells that you can choose to add other terms to the loss, such as ones based on derivatives; or completely different losses! The choice of the loss will have to be motivated and explained in the report.

A good design of the neural network architecture can also help, to exploit simple reasonable assumptions that one might have about f.


2) Second approach : state partially known (first approach in slide 9)
==========================================

From the state vector w = (x,y,z), the dynamics of the Rössler attractor are fully described, by dw/dt = g(w). That is, no more information than just w is needed to compute dw/dt.
In the case of a partially-observed state, more specifically here in the case where one is given only y, one cannot deduce dw/dt or y_{t+dt} from y_t alone. Yet, it is possible to gather more information, by considering the history of y: y_t, y_{t-dt}, y_{t-2dt}... This information might even be sufficient to recover all the hidden information. For instance, when looking at the second equation of the dynamical system, dy/dt = x + ay, one can deduce from the knowledge of y_t and y_{t-dt} that x_t = dy/dt - a y_t ~ (y_t - y_{t-dt})/dt - a y_t. Similarly, it is possible to retrieve z_t (or an approximation thereof) from just the knowledge of a few past y_t'. This implies that a network with inputs (y_t, y_{t-dt}, y_{t-2dt}...) would have enough information to estimate the complete state w, and thus enough information to estimate y_{t+dt}. Another, related approach would be to be based on (y_t, dy_t/dt, d^2y_t/dt^2, ...). Further details about when/whether the full state can be recovered or not are given in Slides 5 & 6.

When observing only y, it is thus still possible to learn to predict y_{t+dt}, by considering the history of y. We are thus interested in finding a function f, represented by a neural network, such that:
f(y_t, y_{t-dt}, y_{t-2dt}...) = y_{t+dt}.
Possible frameworks for this include:
- recurrent networks (which keep a memory of the past), such as LSTM, of the form (prediction for y_{t+1}, h_{t+1}) = F(y_t, h_t) where 'h' denotes the internal state of the network, leading to chains of the form: prediction for y_{t+1} = F_1(y_t, F_2(y_{t-1}, F_2(y_{t-2}, F_2(y_{t-3}, ...)))).
- 1D temporal convolutions, or actually any function of a given time window: y_{t+dt} = f(y_t, y_{t-dt}, y_{t-2dt}, ..., y_{t - K dt}) for some fixed K.
- or a function of the K first derivatives if you choose to consider (y_t, dy_t/dt, d^2y_t/dt^2, ...) instead of the history directly.

The difficulty with a recurrent network is that if the loss (or some piece of code) makes use of the Jacobian (of the function y_t --> dy/dt, e.g.), things are a bit more complex here because of the depencies with respect to the past. The bottom remark on Slide 10 tells what to do in that case.


3) Third approach : state fully known, continuous system (third approach in slide 9 [bottom right])
========================================================

In section 1), the problem was formulated as finding f such that w_{t+dt} = f(w_t), for a given dt. Here instead we formulate the problem as finding g such that:
dw/dt = g(w_t).
This approach is closer to the original problem, that is, retrieving the dynamical system directly.

A first possible way to do this is to consider the supervised learning problem:
estimate the neural network g such that g(w) ~ dw/dt for all samples w.
Issue: how to know the speed dw/dt associated to a sample w?
Solution: you have access to the simulator, which gives w_{t+dt} given w_t. You can approximate dw/dt from this.
Issue 2: how to compute w_{t+dt} from w_t and g(w_t)? Different schemes are possible (first order approximation, or integration over time).

A second possible way is to use an ODE-net, whose code is available online (don't recode it!) and which should do the job for such a simple system.
The idea behing ODE-net is that in the methods above, the prediction step: "prediction of w_{t+dt}" = w_t + dt * g(w_t), is a very crude approximation: it is just a first-order approximation (in dt), which leads to small errors (in O(dt^2)), which will make our chaotic system diverge quite fast. Instead, one can use the power of ODE solvers, which are precisely meant for integrating speed fields over time: we provide the ODE solver with the function g, w_0, t, and the ODE solver returns w_0 + \int_0^t g(w_t) dt = w_0 + \int_0^t dw/dt dt = w_t, or more exactly an approximation thereof, but of much better quality that a crude w_0 + t g(w_0). Note the difference with the previous approaches: given w_0, one can now output w_t for any t, and not only w_{0+dt} for a fixed dt (fixed before training) as with the approach in section 1.
Issue: how to compute the derivative of the loss, i.e. Loss(w_t) = Loss(w_0 + \int_0^t g(w_t) dt ) = Loss( ODESolver(g, w_0, t) ) with respect to parameters of g, while ODESolver is a black box?
Answer: by integrating the gradient backwards in time (as for a standard backprop on a recurrent network, but with a continuous number of time steps), from gradient(time t) to gradient(time 0)... and this integration is done... with another run of the ODESolver!
The ODE-net library code asks you for g, w_0, t and returns grad_{params of g} Loss(w_t), which is what you need to train you model.
Ref for ODE-net: [Neural Ordinary Differential Equations; Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud; best paper at NeurIPS 2018] 

This method is more complex, and training is more difficult, as its loss is based on the error made when predicting w_t from w_0, and that the errors are expected to grow as exp(lambda t).
Yet, once successfully trained, it will be more precise (than the other methods). Another advantage of this approach is that the Jacobian 'A' is easy to compute.


Conclusion
==========

The different methods are compared in Slide 21. Each method has advantages and drawbacks.
You need to choose and implement only one of these approaches.

You will have to provide:
- a small report explaining the choices made (approach chosen, design of loss function, of the network...) + the analysis performed to validate the model (1C + 1D)
- the code to generate a trajectory with your trained model, following the template in time_series.py. Basically, when we call time_series.py, this should load the neural network (already trained and saved in a file that you also provide), generate with it a long trajectory (from t = 0 to t = 10 000 seconds), and save it. Do not forget that if your approach makes use of a time step dt, you should run 10 000 / dt steps to reach t = 10 000 seconds; in that case, please indicate in the code the dt chosen.
- the code to reproduce the pictures in the report.