------------------------------------------------------------------------------------------ Lesson 7 : Information geometry ------------------------------------------------------------------------------------------ Intro: - last session on Information Theory: Fisher information - fundamental applications to many domains: . optimization . modeling: priors on parameters . MDL: precision/encoding-cost of a parameter? ------------------------------------------------------------------------------------------ I - Fisher information ---------------------- + Fisher information/metric/matrix . setting: Model(theta) --> p_theta(x) . formula 1 : J(theta) = E_{x ~ p_theta} [ d^2( - log p_theta(x) ) / d theta ^2 ] . average second derivative of the encoding cost, over all possible data under the model law . formula 2 : E_{x ~ p_theta} [ d log p_theta(x)/dtheta ^T d log p_theta(x)/dtheta ] . covariance of the gradient of the cost, under the model law . obtained by developping: . dlogp/dt = 1/p dp/dt . d2logp/dt2 = -1/p2 dp/dt dp/dt + 1/p d2p/dt2 . J(theta) = int_x [-...] = E[(dlogp/dt)^2] + 0 as int_x d2p/dt2 dx = d2/dt2 int_x p dx = 0 Why information "geometry" ? . norms in the space of probabilities (variations) . examples of observations: x1,x2... or (x1,y1), (x2,y2)... to learn a function y=f(x) ==> finding the right theta = learning the function + second order of KL(p_{theta+deltheta}||p_theta) = int_x p_{theta+deltheta}(x) log p_{theta+deltheta}(x) / p_theta(x) dx . use ln(1+z) = z - 1/2 z^2 + O(z3) . use d ln f(z) / dz = 1/z df/dz . use int_x p_theta(x) = 1 (for any theta) ==> int_x d^k p_theta(x)/dtheta^k = 0 . From now on, all p mean p_theta(x), and del means delta theta, t means theta . p_{theta+dtheta}(x) = p(x) + del dp/dt + 1/2 del d2p/dt2 del + O(del^3) . p_{theta+dtheta}(x) / p_theta(x) = 1 + del 1/p dp/dt + 1/2 del 1/p d2p/dt2 del + O(del^3) . log( p_{theta+dtheta}(x) / p_theta(x) ) = del 1/p dp/dt + 1/2 del 1/p d2p/dt2 del - 1/2 del 1/p2 dp/dt dp/dt^T del + O(deltheta^3) . KL(|) = E_{x~p_{theta+del}}[ ... ] = E_{x~p}[...] + del int dp/dt [...] + ... = 0 + 0 - 1/2 del int 1/p dp/dt dp/dt^2 del + 1 * idem + O(del^3) = 1/2 del E_{x~p}[ dlogp/dt dlogp/dt] del + O(del^3) --> metric associated to KL --> curvature of relative entropy // + differential entropy <--> Fisher information [Cover&Thomas p 672] + in practice: . Fisher: int_{x~p_theta} [...] ~ 1/n sum_i [...] ==> = empirical covariance or average Hessian . sum makes sense as in the same space (tangent of parameter space at point theta) + link with Cramer-Rao bound, reached (theorem, Amari) [Cover&Thomas: chapter 11.10, p 418 (392) : estimator bias, Cramer-Rao] . estimator: given observation x_i, guess the parameter theta so that p_theta(x) fits the best . estimator unbiased if E_{X~p_theta}[ theta_estimated(X) ] = theta . question: mean of estimator ok, what about the variance? is |theta_estimated - theta| typically big? . Theorem: Cramér-Rao inequality: variance(any unbiased estimator) >= 1/J(theta) in dim 1 covariance(estimator) >= J(theta)^-1 in higher dims . Proof in dim 1 . unbiased estimator T . p = p_theta . V = dlogp/dtheta . Cauchy-Schwartz : E_{x~p}[ V T ]^2 <= E[ V^2 ] E[ T^2 ] <= J(theta) var(T) . E[VT] = 1: . E[VT] = int_x p dlogp/dtheta T = d/dtheta int_x p T = d/dtheta theta = 1 ------------------------------------------------------------------------------------------ II - Natural gradient [Bensadon MDL talk 6] --------------------- + desired properties . invariance to reparameterization + importance of metrics . norm of a variation dtheta? . dependance to parameter representation + natural gradient + Approximations and related . Hessian^-1 . diagonalized . Kalman . EM (expectation-minimization) = natural gradient step + Newton . exponential family ------------------------------------------------------------------------------------------ III - Universal coding ---------------------- [Bensadon p 30] - model with parameters - data arrives --> 4 ways to update models / encode parameters : 1 Explicit encoding of parameters . read all data, estimate parameters . encode parameters . re-read all data and encode data given the model with those parameters --> Pb: encode a real value? infinite number of bits? . k first binary digits of theta: K(theta) = k, precision on theta = 2^-k . K(mu) - log( mu(x)) : complexity vs accuracy ==> we'll see the optimal choice of k in next section (using Fisher information) 2 Parameters update, no encoding . start from canonical parameters . encode a bit of data . update the parameters accordingly . iterate In 2: . no need to encode parameters! . gain: no encoded/hard-coded parameter . cost: the data is encoded with wrong parameters at the beginning, so its length is higher before parameters converge 3 Normalized Maximum Likelihood [Bensadon p 34] . issue with 1: . given x, pick theta_1 and encode x with p_theta1 (encoding theta_1 first) or pick theta_2 and encode x with p_theta2 (encoding theta_2 first) ==> these 2 codes are different but encode the same x ==> redundancy ==> not optimal code . solution: for a given x, set p_theta (x) = 0 for all theta for which p_theta (x) < p_{best theta} (x) ; denote theta(x) = best theta for x = argmax_theta p_theta (x) . and renormalize: NML(x) = p_{theta(x)} (x) / sum_z p_{theta(z)} (z) . issues: many, for instance NML(x1,x2) =/= NML(x1) NML(x2|x1) 4 Choose a prior q over parameters . and integrate over it: p_model(x) = \int_theta q(theta) proba_{model,theta}(x) dtheta . now independent of theta . as if replacing sum_mu 2^-mu [...] by \int_theta q(theta) [...] . encode . Note: choosing explicitely a parameter is less efficient: [Bensadon p 34] . max( p(theta_1) p_theta1(x) , p(theta_2) p_theta2(x) ) vs sum_theta p(theta) p_theta(x) . ==> longer codewords! Cover&Thomas p 433-434 Example with Bernouilli(theta): binary sequence 001111000010100100 . code with theta known: entropy = . count number of 1, encode it, select sequence among all sequences with that number of 1 = twice shorter . uniform prior on theta ==> new proba(x1...xn) = \int_theta p_theta(x1...xn) dtheta ==> idem . Laplace estimate of theta on the fly: p(x{i+1}|x{<=i}) = (number of 1 so far + 1)/(i+1 + 1) . from other prior on theta: Dirichlet(1/2,1/2) (= Beta) : p(theta) = 1/ pi sqrt( theta (1-theta) ) ==> as if p(x{i+1}|x{<=i}) = (number of 1 so far + 1/2)/(i +1) ------------------------------------------------------------------------------------------ IV - Parameter precision ------------------------- [Bensadon p 30] + precision when encoding / estimating parameters + cf Cramer-Rao ------------------------------------------------------------------------------------------ V - Prior by default -------------------- + Jeffrey's prior [Bensadon p 34] + Example: Krichevsky­Trofimov estimator . Bernouilli(theta) . ------------------------------------------------------------------------------------------ VI - Examples / Miscellaneous ----------------------------- + justification of BIC? [Bensadon p 32] . BIC : K(mu) = 1/2 number of parameters * log(number of observations) Bayesian Information Criterion (BIC) [Schwartz, 1978]. . These 2 approaches leads to the same encoding cost: . two-part code prior (encode parameter then data): . cost of encoding parameters = optimal precision * nb parameters . regret using Jeffrey's prior // + CTW [Bensadon p 37] // + ex: music partition generation with RNN ? Note: - maximizing entropy can be good: . KL(p_theta||uniform) = sum_x p_theta(x) log ( p_theta(x) |X|) = log|X| - H(p_theta) ==> if you have no information on the law, pick the parameter leading to highest entropy ------------------------------------------------------------------------------------------ VII - Conclusion of the Information Theory part ----------------------------------------------- - Information geometry ------------------------------------------------------------------------------------------