Intro:
- last session on Information Theory: Fisher information
- fundamental applications to many domains:
- optimization
- modeling: priors on parameters
- MDL: precision/encoding-cost of a parameter?
- end of the lesson: fast bandits, needed for projects
- Fisher information/metric/matrix
- setting: Model(\(\theta\)) --> \(p_\theta(x)\)
- formula 1 : \(J(\theta) = E_{x \sim p_\theta} [ d^2( - \log p_\theta(x) ) / d \theta^2 ]\)
- average second derivative of the encoding cost, over all possible data under the model law
- formula 2 : \(E_{x \sim p_\theta} [ d \log p_\theta(x)/d\theta ^T d \log p_\theta(x)/d\theta ]\)
- covariance of the gradient of the cost, under the model law
- obtained by developping:
- \(d\log p/d\theta = 1/p\; dp/d\theta \)
- \(d^2\log p/d\theta ^2 = -1/p^2\; dp/d\theta \; dp/d\theta + 1/p\; d^2p/d\theta ^2\)
- \(J(\theta) = \int_x [-...] = E[(d\log p/dt)^2] + 0\) as \(\int_x d^2p/d\theta ^2 dx = d^2/d\theta ^2 \int_x p dx = 0\)
Why information "geometry" ?
- norms in the space of probabilities (variations)
examples of observations: \(x1,x2\)...
or \((x1,y1), (x2,y2)\)... to learn a function \(y=f(x)\)
\(\implies\;\) finding the right \(\theta\) = learning the function
- second order of \(KL(p_{\theta+\delta\theta }||p_\theta) = \int_x p_{\theta+\delta\theta }(x) \log p_{\theta+\delta\theta }(x) / p_\theta(x) dx\)
- use \(ln(1+z) = z - 1/2 z^2 + O(z3)\)
- use \(d ln f(z) / dz = 1/z\; df/dz\)
- use \(\int_x p_\theta(x) = 1 \)(for any \(\theta\)) \(\;\;\implies\;\;\) \(\int_x d^k p_\theta(x)/d\theta^k = 0\)
- From now on, all \(p\) mean \(p_\theta(x)\)
- \(p_{\theta+d\theta}(x) = p(x) + \delta\theta dp/d\theta + 1/2 \delta\theta \; d^2p/d\theta ^2\; \delta\theta + O(\delta\theta ^3)\)
- \(p_{\theta+d\theta}(x) / p_\theta(x) = 1 + \delta\theta \; 1/p\; dp/d\theta + 1/2 \delta\theta \; 1/p \; d^2p/d\theta ^2 \; \delta\theta + O(\delta\theta ^3)\)
- \(\log( p_{\theta+d\theta}(x) / p_\theta(x) ) = \delta\theta \; 1/p\; dp/d\theta + 1/2 \delta\theta \; 1/p\; d^2p/d\theta ^2\; \delta\theta - 1/2 \delta\theta \; 1/p^2\; dp/d\theta \; dp/d\theta ^T\; \delta\theta + O(\delta\theta ^3)\)
- \(KL(|) = E_{x\sim p_{\theta+\delta\theta }}[ ... ] = E_{x\sim p}[...] + \delta\theta \int dp/d\theta [...] + ...\)
\(= 0 + 0 - 1/2\; \delta\theta \int 1/p\; dp/d\theta \; dp/d\theta ^2\; \delta\theta + 1 * idem + O(\delta\theta ^3)\)
\(= 1/2 \;\delta\theta E_{x\sim p}[ d\log p/d\theta \; d\log p/d\theta ] \delta\theta + O(\delta\theta ^3)\)
--> metric associated to KL
--> curvature of relative entropy
// + differential entropy <--> Fisher information [Cover&Thomas p 672]
- In practice:
- Fisher: \(\int_{x\sim p_\theta} [...] \sim 1/n \sum_i [...]\)
\(\implies\;\) = empirical covariance or average Hessian
- sum makes sense as in the same space (tangent of parameter space at point \(\theta\))
- link with Cramer-Rao bound, reached (theorem, Amari) [Cover&Thomas: chapter 11.10, p 418 (392) : estimator bias, Cramer-Rao]
- estimator: given observation \(x_i\), guess the parameter \(\theta\) so that \(p_\theta(x)\) fits the best
- estimator unbiased if \(E_{X\sim p_\theta}[ \theta_{estimated}(X) ] = \theta\)
- question: mean of estimator ok, what about the variance? is \(|\theta_{estimated} - \theta|\) typically big?
- Theorem: Cramér-Rao inequality:
variance(any unbiased estimator) \(\geq 1/J(\theta)\) in dim 1
covariance(estimator) \(\geq J(\theta)^{-1}\) in higher dims
- Proof in dim 1
- unbiased estimator T
- \(p = p_\theta\)
- \(V = d\log p/d\theta\)
- Cauchy-Schwartz : \(E_{x\sim p}[ V T ]^2 \leqslant E[ V^2 ] E[ T^2 ]\) \(\leqslant J(\theta)\; var(T)\)
- \(E[VT] = 1\):
- \(E[VT] = \int_x p \; d\log p/d\theta\; T = d/d\theta \int_x p\; T = d/d\theta \; \theta = 1\)
II - Natural gradient [Bensadon MDL talk 6]
- desired properties
- invariance to reparameterization
- importance of metrics
- norm of a variation \(d\theta\)?
- dependance to parameter representation
natural gradient
- Approximations and related
- \(Hessian^-1\)
- diagonalized
- Kalman
- EM (expectation-minimization) = natural gradient step
- Newton
III - Universal coding
[Bensadon p 30]
- model with parameters
- data arrives --> 4 ways to update models / encode parameters :
1. Explicit encoding of parameters
- read all data, estimate parameters
- encode parameters
- re-read all data and encode data given the model with those parameters --> Pb: encode a real value? infinite number of bits?
- \(k\) first binary digits of \(\theta\): \(K(\theta) = k\), precision on \(\theta = 2^{-k}\)
- \(K(\mu) - \log( \mu(x))\) : complexity vs accuracy \(\implies\;\) we'll see the optimal choice of k in next section (using Fisher information)
2. Parameters update, no encoding
- start from canonical parameters
- encode a bit of data
- update the parameters accordingly
- iterate
In 2: - no need to encode parameters! - gain: no encoded/hard-coded parameter - cost: the data is encoded with wrong parameters at the beginning, so its length is higher before parameters converge
3. Normalized Maximum Likelihood [Bensadon p 34]
- issue with 1:
- given \(x\), pick \(\theta_1\) and encode \(x\) with \(p_{\theta_1}\) (encoding \(\theta_1\) first) or pick \(\theta_2\) and encode \(x\) with \(p_{\theta_2}\) (encoding \(\theta_2\) first)
\(\implies\;\) these 2 codes are different but encode the same \(x\)
\(\implies\;\) redundancy \(\implies\;\) not optimal code
- solution: for a given \(x\), set \(p_\theta (x) = 0\) for all \(\theta\) for which \(p_\theta (x) < p_{best \theta} (x)\) ; denote \(\theta(x) =\) best \(\theta\) for \(x = argmax_\theta\; p_\theta (x)\)
- and renormalize: \(NML(x) = p_{\theta(x)} (x) / \sum_z p_{\theta(z)} (z)\)
- issues: many, for instance \(NML(x1,x2) \neq NML(x1)\; NML(x2|x1)\)
4. Choose a prior \(q\) over parameters
- and integrate over it: \(p_{model}(x) = \int_\theta q(\theta)\; proba_{model,\theta}(x) d\theta\)
- now independent of \(\theta\)
- as if replacing \(\sum_\mu 2^{-\mu} [...]\) by \(\int_\theta q(\theta) [...]\)
encode
- Note: choosing explicitely a parameter is less efficient: [Bensadon p 34]
- \(\max( p(\theta_1)\; p_{\theta_1}(x) ,\; p(\theta_2)\; p_{\theta_2}(x) )\;\;\) vs \(\;\;\sum_\theta p(\theta) \; p_\theta(x)\)
- \(\implies\;\) longer codewords!
Cover&Thomas p 433-434
Example with Bernouilli(\(\theta\)): binary sequence 001111000010100100
- code with \(\theta\) known: entropy =
- count number of 1, encode it, select sequence among all sequences with that number of 1 = twice shorter
- uniform prior on \(\theta\) \(\implies\;\) new \(proba(x_1, ..., x_n) = \int_\theta p_\theta(x_1, ..., x_n) d\theta\) \(\implies\;\) idem
- Laplace estimate of \(\theta\) on the fly: \(p(x_{i+1}|x_{\leqslant i}) = \frac{\mathrm{number\; of\; 1\; so\; far} + 1}{i+1 + 1}\)
- from other prior on \(\theta\): Dirichlet\((1/2,1/2)\) (= \(\beta\)) : \(p(\theta) = 1/ \pi \sqrt{ \theta (1-\theta) }\) \(\implies\;\) as if \(p(x_{i+1}|x_{\leq i}) = (\)number of 1 so far \(+ 1/2)/(i +1)\)
IV - Parameter precision
[Bensadon p 30]
V - Prior by default
VI - Examples / Miscellaneous
- justification of BIC? [Bensadon p 32]
- BIC : \(K(\mu) = 1/2\) number of parameters * \(\log\)(number of observations) Bayesian Information Criterion (BIC) [Schwartz, 1978].
- These 2 approaches leads to the same encoding cost:
- two-part code prior (encode parameter then data):
- cost of encoding parameters = optimal precision * nb parameters
- regret using Jeffrey's prior
// + CTW [Bensadon p 37] // + ex: music partition generation with RNN ?
Note: - maximizing entropy can be good: + \(KL(p_\theta||uniform) = \sum_x p_\theta(x) \log ( p_\theta(x) |X|) = \log|X| - H(p_\theta)\) \(\implies\;\) if you have no information on the law, pick the parameter leading to highest entropy