------------------------------------------------------------------------------------------
Chapter 3 : Kolmogorov complexity
------------------------------------------------------------------------------------------
Introduction
. sequence completion : how?
. sequence probability? preference?
. sequence model?
Find a model for the following sequences:
. 0101010101010101010101010101010101010101010101010101010101010101 : repeat 01
. 0110101000001001111001100110011111110011101111001100100100001000 : srqt(2)-1 in binary writing!
. 1101111001110101111101101111101110101101111000101110010100111011 : too many 1s for uniform law --> encodeable in log n + n H(k/n) [encode n,k first to describe k/n] or give its number in the set of all sequences with k 1 (see the method in the proof of the entropic bound later).
. english text + same text in german : if short, just compress both independently
if very long, encode how to translate english to german, and use that knowledge when decoding
------------------------------------------------------------------------------------------
I - Kolmogorov complexity
-------------------------
- Definition [Cover&Thomas p 463]
. K(string s) = description cost of s = the length of the shortest program that can produce it
. unit: bits
. simple examples : 000000, first digits of pi
. associated proba : 2^-K(s)
=> distribution over strings
. sums to 1? no, to Omega (Chaitin's constant) which is provably non-computable
. halting problem
- Universality [Cover&Thomas p 427]
. more or less Turing-machine-invariant
. universal Turing machine
. Church's thesis : all computational models are equivalent (if sufficiently complex)
. K_{computer 1}(s) <= K_{computer 2}(s) + C_{1,2} for all strings s
by running a simulator of Computer 2 (of size with C_{1,2} bits) on Computer 1, and executing the program associated to K_{computer 2}(s).
. we'll always have +constant in all formulas bounding K(s)
. not dependent on encoding
. given two possible binary encodings f and g for data x (which is not binary)
K( g(x) ) <= K( f(x) ) + K( g o f^-1 ) (+c)
. these constants are small (<1MB) compared to possible big data size (GB, TB)
==> Kolmogorov complexity really makes sense
- Extensions (defined up to a multiplicative constant, as exp(something +/- c) )
. proba from Kolmogorov: p(s) = 2^-K(s)
. version sum_{over all programs p that produce s} 2^-length(p)
. version with probabilistic programs: sum_{all random programs p} 2^-|p| P(p outputs s)
. with distributions: sum_{all probability distributions mu} 2^-|mu| mu(x)
--> Proposition: these 4 distributions are equivalent (i.e. for all i,j, exists C, P_i <= C P_j) [A.K. Zvonkin and L.A. Levin, 1970]
==> named Solomonoff universal prior (for prediction)
. relative complexity K(s|z) when z is already available (no need to describe it): concept similar to mutual information, conditional entropy, etc.
------------------------------------------------------------------------------------------
II - Bounds
-----------
- Easy upper bounds
. K(s) <= length(s) + 2 log length(s) + c : program "print the following chain, of length [length(s)]: [s]"
or + log + log log + log log log + ... + c (cf previous lesson)
. K(s) <= |s| + K(|s|) +c (more general than above) [Note: "|s|" denotes "length(s)"]
. can't compress everything (doesn't fit):
number of strings s with K(s) <= n is 1 + 2 + 2^2 + 2^3 + 2^4 +... + 2^n < 2^{n+1}
because their programs (all different) have to be written in less than n bits (so: either 0 bit, either 1 bit, either 2... either n)
==> not many strings are simple
. strings s s.t. K(s | |s|) > |s| are named algorithmically "random" by Kolmogorov (because no regularity) [Note: Kolmogorov is also the mathematician who formalized probabilities and randomness the way we use them nowadays]
. infinite binary strings such that lim_{n-->infty} K(x1...xn|n)/n = 1 are called incompressible
. K(s) <= |zip(s)| + |unzip program|
==> distance based on zip used to cluster files (text, MIDI files...) and it worked! (clusters by authors)
. d(x,y) = max( K(x|y), K(y|x) ) / max( K(x), K(y) )
. Rudi Cilibrasi and Paul Vitanyi. Clustering by compression.
. If x in E (finite), K(x) <= K(E) + |^ log |E| ^| +c
. Generalization: given a set X and a proba mu on it,
K(x) <= K(mu) - log mu(x) +c
==> in machine learning: a model mu is good if this quantity is small !!!
- Kolmogorov complexity is non computable!!!
. see Gödel, Turing, halt problem: indecidability of the output of some programs (whether they halt, and if yes, what they output)
. cannot prove that K(x) > 1 MB, whatever x is
. Berry paradox :
"The smallest number that cannot be described in less than 13 words"
that we will update into (see below):
"The [first x found] that cannot be described in less than [L] bits".
. paradoxal proof in two steps, by Chaitin : (Chaitin's incompleteness theorem, 1971)
Step 1: Proposition:
There exists a constant L s.t. it is not possible to prove the statement K(x) > L for any x.
Proof: pick L = 1 MB. Write a program which goes through all possible proofs and stops when finding a proof of K(x)>L (whatever x is), and prints that x. This program has length < L. If it stops and prints an x, that x can be described with this program... and K(x) < L. Contradiction! So this program never stops. And consequently there doesn't exist any proof of the form "K(x) > 1 MB" for any x.
Step 2: Theorem:
. Kolmogorov complexity is not computable
Proof: consider all integers between 1 and 2^{L+1}. There are only at most 2^{L+1}-1 programs of length <= L (we proved it earlier), so at least one of these integers n0 has K(n0) > L. If there existed a program able to compute the Kolmogorov complexity of any integer, by computing K(n0) one would prove K(n0) > L, contradiction! So such a program doesn't exist.
==> not possible to have lower bounds on the Kolmogorov complexity of a [sufficiently complex] string (i.e. provided it's over 1MB).
- Entropic bound [Cover&Thomas p 473]
. cannot prove lower bound Kolmogorov for a given s BUT can prove lower bounds *on average*
. consider a proba distrib f over a set X
. H(X) <= E_{x~f^n}[ 1/n K(x) ] <= H(X) + (|X|+1) log n / n +c/n
. ==> lim_{n-->infty} E_{x~f^n}[ 1/n K(x) ] = H(X) !!!! cannot do better than entropy (on average)
. Proof:
. lower-bound: K(s) is the length of an encoding (the code of s is the shortest program describing it) ==> Gibbs inequality ==> cannot be better than entropy (KL >= 0)
. upper-bound: by explicit construction
. encode n: costs 2 log n
. count the number of each symbol of the alphabet in s : n_a, n_b, n_c... n_z
. encode them: costs |X| log n (X = the alphabet) (actually: |X|-1, as n_z is known: n_z = n - (n_a + n_b + ...))
. now: draw the list of all possible sequences of n characters which have exactly n_a 'a', n_b 'b', etc.
. this list has size <= 2^{n H( Bernouilli(n_a/n, n_b/n ...) )}
--> because indeed, C^n_k = (n k) <= 2^{ n H( Bernouilli(k/n) )} using Stirling formula, or using sum_k (n k) p^k (1-p)^(n-k) = 1, with p = k/n and considering (n k) p^k (1-p)^(n-k) <= 1 (cf details in Cover&Thomas)
. so encoding our sequence within this list has complexity <= n H(X) using Jensen
. total encoding cost: = what is written in the formula
------------------------------------------------------------------------------------------
III - MDL (Minimum Length Description)
--------------------------------------
- Definition
. general criterion for model selection
given a set X and a proba mu on it, K(x) <= K(mu) - log mu(x) +c
--> K(mu) : complexity of the model
--> - log mu(x) : likelihood (how well the model fits the data)
. natural trade-off between model complexity and accuracy : cf Occam's razor
. deals naturally with overfit: encode the model as well
. examples:
. Dirac: K(mu) = K(x); - log mu(x) = 0
. Gaussian distrib N(m,sigma) fitted to a cloud of points (x_i): - log mu(x) = sum_i |x_i - m|^2 +c
==> choose m = mean of x_i by solving the least square error problem (if neglecting K(m))
- Instantiations
. AIC, BIC : approximation models for K(model)
. AIC : K(mu) = number of parameters of mu
Akaike Information Criterion (AIC) [1973].
. BIC : K(mu) = 1/2 number of parameters * log(number of observations)
Bayesian Information Criterion (BIC) [Schwartz, 1978].
==> justification next lesson (Fisher information)
. examples?
. NB: AIC/BIC are not good approximations of K(mu) for neural networks (millions of parameters). For a better approach in that case, check [The Description Length of Deep Learning models; Léonard Blier, Yann Ollivier; NeurIPS 2018, https://arxiv.org/abs/1802.07044]
- Restricted families of programs [Ollivier/Bensadon p 25]
. K(s) = |zip(s)|
. combine naive generative models into a more complex one (cf course reinforcement learning)
--> Markov models
. auto-encoder: the middle (smallest) layer is the data compressed
. programs on Turing machines --> restrict to finite automata : then Kolmogorov complexity computable
------------------------------------------------------------------------------------------
IV - Conclusion
---------------
MDL = very general principle, to formulate any machine learning problem
Next chapter:
- end of information theory (Fisher information)
------------------------------------------------------------------------------------------
References
----------
Bensadon's thesis (Yann Ollivier's presentations) : Part I
Cover & Thomas : Chapter 14 mainly, + Chapter 13 + 11.3
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Practical session : finishing practical of chapter 2
------------------------------------------------------------------------------------------
- don't forget to test all models learned by generating new text
- Markov chains (formulas for entropy/cross-entropy, etc.)
- note that entropy decreases when the order of the model increases (exercise: prove it mathematically)
- how to deal with symbols not seen yet : consider a new probability : (1-epsilon) p_model + epsilon p_uniform
- how to choose epsilon above : by cross-validation!
- draw on the same chart all entropies (cross-entropies), considering all models
- cluster the texts using Kullback-Leibler (= relative cross-entropy), with classes being languages, or authors, or styles or topics...