------------------------------------------------------------------------------------------
                        Lesson 7 :   Information geometry
------------------------------------------------------------------------------------------

Intro:
- last session on Information Theory: Fisher information
- fundamental applications to many domains:
  . optimization
  . modeling: priors on parameters
  . MDL: precision/encoding-cost of a parameter?

------------------------------------------------------------------------------------------

I - Fisher information
----------------------


+ Fisher information/metric/matrix
  . setting: Model(theta) --> p_theta(x)
  . formula 1 : J(theta) = E_{x ~ p_theta} [ d^2( - log p_theta(x) ) / d theta ^2 ]  
    . average second derivative of the encoding cost, over all possible data under the model law
  . formula 2 :  E_{x ~ p_theta} [ d log p_theta(x)/dtheta ^T  d log p_theta(x)/dtheta ]
    . covariance of the gradient of the cost, under the model law
    . obtained by developping:
      . dlogp/dt = 1/p dp/dt
      . d2logp/dt2 = -1/p2 dp/dt dp/dt + 1/p d2p/dt2
      . J(theta) = int_x [-...] = E[(dlogp/dt)^2] + 0   
          as int_x d2p/dt2 dx = d2/dt2 int_x p dx = 0

Why information "geometry" ? 
  . norms in the space of probabilities (variations)
  . examples of observations: x1,x2... 
     or (x1,y1), (x2,y2)... to learn a function y=f(x)
   ==> finding the right theta = learning the function

+ second order of KL(p_{theta+deltheta}||p_theta) 
  = int_x p_{theta+deltheta}(x)  log  p_{theta+deltheta}(x) / p_theta(x)  dx
  . use  ln(1+z) = z - 1/2 z^2 + O(z3)
  . use  d ln f(z) / dz = 1/z df/dz    
  . use int_x p_theta(x) = 1 (for any theta) ==>  int_x d^k p_theta(x)/dtheta^k = 0
  . From now on, all p mean p_theta(x), and del means delta theta, t means theta
  . p_{theta+dtheta}(x) = p(x) + del dp/dt + 1/2 del d2p/dt2 del + O(del^3)
  . p_{theta+dtheta}(x) / p_theta(x) = 1 + del 1/p dp/dt + 1/2 del 1/p d2p/dt2 del + O(del^3)
  . log( p_{theta+dtheta}(x) / p_theta(x) ) = del 1/p dp/dt + 1/2 del 1/p d2p/dt2 del 
                                             - 1/2 del 1/p2 dp/dt dp/dt^T del +  O(deltheta^3)
  . KL(|) = E_{x~p_{theta+del}}[ ... ]  =  E_{x~p}[...] + del int dp/dt [...] + ...
    = 0 + 0 - 1/2 del int 1/p dp/dt dp/dt^2 del + 1 * idem  + O(del^3)
    = 1/2 del E_{x~p}[ dlogp/dt dlogp/dt] del  + O(del^3)
   
    --> metric associated to KL
    --> curvature of relative entropy
 
//  + differential entropy <--> Fisher information   [Cover&Thomas p 672]
    
+ in practice:
  . Fisher: int_{x~p_theta} [...] ~ 1/n sum_i [...]  
    ==> = empirical covariance or average Hessian
    . sum makes sense as in the same space (tangent of parameter space at point theta)

+ link with Cramer-Rao bound, reached (theorem, Amari)
  [Cover&Thomas: chapter 11.10, p 418 (392) : estimator bias, Cramer-Rao]
  . estimator: given observation x_i, guess the parameter theta so that p_theta(x) fits the best
  . estimator unbiased if  E_{X~p_theta}[ theta_estimated(X) ] = theta
  . question: mean of estimator ok, what about the variance? is |theta_estimated - theta| typically big?
  . Theorem: Cramér-Rao inequality:
     variance(any unbiased estimator) >= 1/J(theta)       in dim 1
     covariance(estimator) >= J(theta)^-1           in higher dims
  . Proof in dim 1
    . unbiased estimator T
    . p = p_theta
    . V = dlogp/dtheta
    . Cauchy-Schwartz : E_{x~p}[ V T ]^2 <= E[ V^2 ] E[ T^2 ]
                                         <= J(theta) var(T)
    . E[VT] = 1:
      . E[VT] = int_x p dlogp/dtheta T = d/dtheta int_x p T = d/dtheta theta = 1

------------------------------------------------------------------------------------------

II - Natural gradient   [Bensadon MDL talk 6]
---------------------

  + desired properties 
    . invariance to reparameterization
      + importance of metrics
       . norm of a variation dtheta?
       . dependance to parameter representation

  + natural gradient 

  + Approximations and related
    . Hessian^-1
    . diagonalized
    . Kalman
    . EM (expectation-minimization) = natural gradient step

  + Newton
    . exponential family


------------------------------------------------------------------------------------------

III - Universal coding
----------------------

[Bensadon p 30]

- model with parameters
- data arrives
--> 4 ways to update models / encode parameters :


1 Explicit encoding of parameters
  . read all data, estimate parameters
  . encode parameters
  . re-read all data and encode data given the model with those parameters
  --> Pb: encode a real value? infinite number of bits?
  . k first binary digits of theta: K(theta) = k, precision on theta = 2^-k
  . K(mu) - log( mu(x)) : complexity vs accuracy
    ==> we'll see the optimal choice of k in next section (using Fisher information)

2 Parameters update, no encoding
  . start from canonical parameters
  . encode a bit of data
  . update the parameters accordingly
  . iterate

 In 2:
  . no need to encode parameters!
  . gain: no encoded/hard-coded parameter
  . cost: the data is encoded with wrong parameters at the beginning, so its length is higher before parameters converge


3 Normalized Maximum Likelihood [Bensadon p 34]
  . issue with 1: 
     . given x, pick theta_1 and encode x with p_theta1 (encoding theta_1 first)
            or  pick theta_2 and encode x with p_theta2 (encoding theta_2 first)
      ==> these 2 codes are different but encode the same x
       ==> redundancy ==> not optimal code
  . solution: for a given x, set p_theta (x) = 0 for all theta for which p_theta (x) < p_{best theta} (x)  ;  denote theta(x) = best theta for x = argmax_theta p_theta (x)
  . and renormalize: NML(x) = p_{theta(x)} (x)  /  sum_z p_{theta(z)} (z)    
  . issues: many, for instance NML(x1,x2) =/= NML(x1) NML(x2|x1)


4 Choose a prior q over parameters
  . and integrate over it:  p_model(x) = \int_theta q(theta) proba_{model,theta}(x) dtheta
  . now independent of theta
  . as if replacing sum_mu 2^-mu [...] by   \int_theta q(theta) [...]
  . encode 

  . Note: choosing explicitely a parameter is less efficient: [Bensadon p 34]
    . max( p(theta_1) p_theta1(x) , p(theta_2) p_theta2(x) )   vs   sum_theta p(theta) p_theta(x)
    . ==> longer codewords!


Cover&Thomas p 433-434

Example with Bernouilli(theta): binary sequence 001111000010100100 
  . code with theta known: entropy = 
  . count number of 1, encode it, select sequence among all sequences with that number of 1 = twice shorter
  . uniform prior on theta ==> new proba(x1...xn) = \int_theta p_theta(x1...xn) dtheta ==> idem
  . Laplace estimate of theta on the fly: p(x{i+1}|x{<=i}) = (number of 1 so far + 1)/(i+1 + 1)
  . from other prior on theta: Dirichlet(1/2,1/2) (= Beta) : p(theta) = 1/ pi sqrt( theta (1-theta) )
    ==> as if  p(x{i+1}|x{<=i}) = (number of 1 so far + 1/2)/(i +1)


------------------------------------------------------------------------------------------

IV - Parameter precision
-------------------------

[Bensadon p 30]

  + precision when encoding / estimating parameters
    

  + cf Cramer-Rao


------------------------------------------------------------------------------------------

V - Prior by default
--------------------

  + Jeffrey's prior  [Bensadon p 34]

  + Example: Krichevsky­Trofimov estimator
    . Bernouilli(theta)
    . 


------------------------------------------------------------------------------------------

VI - Examples / Miscellaneous
-----------------------------

 + justification of BIC?  [Bensadon p 32]
   . BIC : K(mu) = 1/2  number of parameters * log(number of observations)  
        Bayesian Information Criterion (BIC) [Schwartz, 1978].
   . These 2 approaches leads to the same encoding cost:
     . two-part code prior (encode parameter then data): 
        . cost of encoding parameters = optimal precision * nb parameters 
     . regret using Jeffrey's prior


// + CTW [Bensadon p 37]
// + ex: music partition generation with RNN ?

Note:
 - maximizing entropy can be good:
   . KL(p_theta||uniform) = sum_x p_theta(x) log ( p_theta(x) |X|) = log|X| - H(p_theta)
      ==> if you have no information on the law, pick the parameter leading to highest entropy


------------------------------------------------------------------------------------------

VII - Conclusion of the Information Theory part
-----------------------------------------------

- Information geometry

------------------------------------------------------------------------------------------