------------------------------------------------------------------------------------------ Chapter 3 : Kolmogorov complexity ------------------------------------------------------------------------------------------ Introduction . sequence completion : how? . sequence probability? preference? . sequence model? Find a model for the following sequences: . 0101010101010101010101010101010101010101010101010101010101010101 : repeat 01 . 0110101000001001111001100110011111110011101111001100100100001000 : srqt(2)-1 in binary writing! . 1101111001110101111101101111101110101101111000101110010100111011 : too many 1s for uniform law --> encodeable in log n + n H(k/n) [encode n,k first to describe k/n] or give its number in the set of all sequences with k 1 (see the method in the proof of the entropic bound later). . english text + same text in german : if short, just compress both independently if very long, encode how to translate english to german, and use that knowledge when decoding ------------------------------------------------------------------------------------------ I - Kolmogorov complexity ------------------------- - Definition [Cover&Thomas p 463] . K(string s) = description cost of s = the length of the shortest program that can produce it . unit: bits . simple examples : 000000, first digits of pi . associated proba : 2^-K(s) => distribution over strings . sums to 1? no, to Omega (Chaitin's constant) which is provably non-computable . halting problem - Universality [Cover&Thomas p 427] . more or less Turing-machine-invariant . universal Turing machine . Church's thesis : all computational models are equivalent (if sufficiently complex) . K_{computer 1}(s) <= K_{computer 2}(s) + C_{1,2} for all strings s by running a simulator of Computer 2 (of size with C_{1,2} bits) on Computer 1, and executing the program associated to K_{computer 2}(s). . we'll always have +constant in all formulas bounding K(s) . not dependent on encoding . given two possible binary encodings f and g for data x (which is not binary) K( g(x) ) <= K( f(x) ) + K( g o f^-1 ) (+c) . these constants are small (<1MB) compared to possible big data size (GB, TB) ==> Kolmogorov complexity really makes sense - Extensions (defined up to a multiplicative constant, as exp(something +/- c) ) . proba from Kolmogorov: p(s) = 2^-K(s) . version sum_{over all programs p that produce s} 2^-length(p) . version with probabilistic programs: sum_{all random programs p} 2^-|p| P(p outputs s) . with distributions: sum_{all probability distributions mu} 2^-|mu| mu(x) --> Proposition: these 4 distributions are equivalent (i.e. for all i,j, exists C, P_i <= C P_j) [A.K. Zvonkin and L.A. Levin, 1970] ==> named Solomonoff universal prior (for prediction) . relative complexity K(s|z) when z is already available (no need to describe it): concept similar to mutual information, conditional entropy, etc. ------------------------------------------------------------------------------------------ II - Bounds ----------- - Easy upper bounds . K(s) <= length(s) + 2 log length(s) + c : program "print the following chain, of length [length(s)]: [s]" or + log + log log + log log log + ... + c (cf previous lesson) . K(s) <= |s| + K(|s|) +c (more general than above) [Note: "|s|" denotes "length(s)"] . can't compress everything (doesn't fit): number of strings s with K(s) <= n is 1 + 2 + 2^2 + 2^3 + 2^4 +... + 2^n < 2^{n+1} because their programs (all different) have to be written in less than n bits (so: either 0 bit, either 1 bit, either 2... either n) ==> not many strings are simple . strings s s.t. K(s | |s|) > |s| are named algorithmically "random" by Kolmogorov (because no regularity) [Note: Kolmogorov is also the mathematician who formalized probabilities and randomness the way we use them nowadays] . infinite binary strings such that lim_{n-->infty} K(x1...xn|n)/n = 1 are called incompressible . K(s) <= |zip(s)| + |unzip program| ==> distance based on zip used to cluster files (text, MIDI files...) and it worked! (clusters by authors) . d(x,y) = max( K(x|y), K(y|x) ) / max( K(x), K(y) ) . Rudi Cilibrasi and Paul Vitanyi. Clustering by compression. . If x in E (finite), K(x) <= K(E) + |^ log |E| ^| +c . Generalization: given a set X and a proba mu on it, K(x) <= K(mu) - log mu(x) +c ==> in machine learning: a model mu is good if this quantity is small !!! - Kolmogorov complexity is non computable!!! . see Gödel, Turing, halt problem: indecidability of the output of some programs (whether they halt, and if yes, what they output) . cannot prove that K(x) > 1 MB, whatever x is . Berry paradox : "The smallest number that cannot be described in less than 13 words" that we will update into (see below): "The [first x found] that cannot be described in less than [L] bits". . paradoxal proof in two steps, by Chaitin : (Chaitin's incompleteness theorem, 1971) Step 1: Proposition: There exists a constant L s.t. it is not possible to prove the statement K(x) > L for any x. Proof: pick L = 1 MB. Write a program which goes through all possible proofs and stops when finding a proof of K(x)>L (whatever x is), and prints that x. This program has length < L. If it stops and prints an x, that x can be described with this program... and K(x) < L. Contradiction! So this program never stops. And consequently there doesn't exist any proof of the form "K(x) > 1 MB" for any x. Step 2: Theorem: . Kolmogorov complexity is not computable Proof: consider all integers between 1 and 2^{L+1}. There are only at most 2^{L+1}-1 programs of length <= L (we proved it earlier), so at least one of these integers n0 has K(n0) > L. If there existed a program able to compute the Kolmogorov complexity of any integer, by computing K(n0) one would prove K(n0) > L, contradiction! So such a program doesn't exist. ==> not possible to have lower bounds on the Kolmogorov complexity of a [sufficiently complex] string (i.e. provided it's over 1MB). - Entropic bound [Cover&Thomas p 473] . cannot prove lower bound Kolmogorov for a given s BUT can prove lower bounds *on average* . consider a proba distrib f over a set X . H(X) <= E_{x~f^n}[ 1/n K(x) ] <= H(X) + (|X|+1) log n / n +c/n . ==> lim_{n-->infty} E_{x~f^n}[ 1/n K(x) ] = H(X) !!!! cannot do better than entropy (on average) . Proof: . lower-bound: K(s) is the length of an encoding (the code of s is the shortest program describing it) ==> Gibbs inequality ==> cannot be better than entropy (KL >= 0) . upper-bound: by explicit construction . encode n: costs 2 log n . count the number of each symbol of the alphabet in s : n_a, n_b, n_c... n_z . encode them: costs |X| log n (X = the alphabet) (actually: |X|-1, as n_z is known: n_z = n - (n_a + n_b + ...)) . now: draw the list of all possible sequences of n characters which have exactly n_a 'a', n_b 'b', etc. . this list has size <= 2^{n H( Bernouilli(n_a/n, n_b/n ...) )} --> because indeed, C^n_k = (n k) <= 2^{ n H( Bernouilli(k/n) )} using Stirling formula, or using sum_k (n k) p^k (1-p)^(n-k) = 1, with p = k/n and considering (n k) p^k (1-p)^(n-k) <= 1 (cf details in Cover&Thomas) . so encoding our sequence within this list has complexity <= n H(X) using Jensen . total encoding cost: = what is written in the formula ------------------------------------------------------------------------------------------ III - MDL (Minimum Length Description) -------------------------------------- - Definition . general criterion for model selection given a set X and a proba mu on it, K(x) <= K(mu) - log mu(x) +c --> K(mu) : complexity of the model --> - log mu(x) : likelihood (how well the model fits the data) . natural trade-off between model complexity and accuracy : cf Occam's razor . deals naturally with overfit: encode the model as well . examples: . Dirac: K(mu) = K(x); - log mu(x) = 0 . Gaussian distrib N(m,sigma) fitted to a cloud of points (x_i): - log mu(x) = sum_i |x_i - m|^2 +c ==> choose m = mean of x_i by solving the least square error problem (if neglecting K(m)) - Instantiations . AIC, BIC : approximation models for K(model) . AIC : K(mu) = number of parameters of mu Akaike Information Criterion (AIC) [1973]. . BIC : K(mu) = 1/2 number of parameters * log(number of observations) Bayesian Information Criterion (BIC) [Schwartz, 1978]. ==> justification next lesson (Fisher information) . examples? . NB: AIC/BIC are not good approximations of K(mu) for neural networks (millions of parameters). For a better approach in that case, check [The Description Length of Deep Learning models; Léonard Blier, Yann Ollivier; NeurIPS 2018, https://arxiv.org/abs/1802.07044] - Restricted families of programs [Ollivier/Bensadon p 25] . K(s) = |zip(s)| . combine naive generative models into a more complex one (cf course reinforcement learning) --> Markov models . auto-encoder: the middle (smallest) layer is the data compressed . programs on Turing machines --> restrict to finite automata : then Kolmogorov complexity computable ------------------------------------------------------------------------------------------ IV - Conclusion --------------- MDL = very general principle, to formulate any machine learning problem Next chapter: - end of information theory (Fisher information) ------------------------------------------------------------------------------------------ References ---------- Bensadon's thesis (Yann Ollivier's presentations) : Part I Cover & Thomas : Chapter 14 mainly, + Chapter 13 + 11.3 ------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------ Practical session : finishing practical of chapter 2 ------------------------------------------------------------------------------------------ - don't forget to test all models learned by generating new text - Markov chains (formulas for entropy/cross-entropy, etc.) - note that entropy decreases when the order of the model increases (exercise: prove it mathematically) - how to deal with symbols not seen yet : consider a new probability : (1-epsilon) p_model + epsilon p_uniform - how to choose epsilon above : by cross-validation! - draw on the same chart all entropies (cross-entropies), considering all models - cluster the texts using Kullback-Leibler (= relative cross-entropy), with classes being languages, or authors, or styles or topics...