Lesson 4 : Information geometry

Intro:


I - Fisher information

Why information "geometry" ?

// + differential entropy <--> Fisher information [Cover&Thomas p 672]


II - Natural gradient [Bensadon MDL talk 6]


III - Universal coding

[Bensadon p 30]

1. Explicit encoding of parameters

2. Parameters update, no encoding

In 2: - no need to encode parameters! - gain: no encoded/hard-coded parameter - cost: the data is encoded with wrong parameters at the beginning, so its length is higher before parameters converge

3. Normalized Maximum Likelihood [Bensadon p 34]

4. Choose a prior \(q\) over parameters

Cover&Thomas p 433-434

Example with Bernouilli(\(\theta\)): binary sequence 001111000010100100


IV - Parameter precision

[Bensadon p 30]


V - Prior by default


VI - Examples / Miscellaneous

// + CTW [Bensadon p 37] // + ex: music partition generation with RNN ?

Note: - maximizing entropy can be good: + \(KL(p_\theta||uniform) = \sum_x p_\theta(x) \log ( p_\theta(x) |X|) = \log|X| - H(p_\theta)\) \(\implies\;\) if you have no information on the law, pick the parameter leading to highest entropy


VII - Conclusion of the Information Theory part