------------------------------------------------------------------------------------------ Lesson 2 : Compression/Prediction/Generation equivalence ------------------------------------------------------------------------------------------ I - Concepts: compression/prediction/generation ----------------------------------------------- - lossless compression / prediction / generative models . data = sequence of symbols, from a known predefined alphabet (discrete setting) . lossless compression: . encoder: data --> binary string injective! otherwise not decodable (known as: uniquely decodeable code) . decoder: binary string --> data encoder not surjective ==> not optimal compression (easy to build a better compressor) . lossless <==> decoder o encoder = Id (no loss) . prediction . given a sequence of observed variables (x1,...xn), predict x_{n+1} . i.e. give the probability distribution of each possible label (discrete space) . so: predictor = function : data (x1...xn) --> p(x_{n+1}|x1,...xn) probability of next symbol . generative model . model able to generate new data . data D is generated according to (= as often as) its model probability p(D) ------------------------------------------------------------------------------------------ II - From one concept to the other ---------------------------------- - prediction --> generation [easy] . start from x1 (given or taken from a learned distribution) . choose x_{n+1} according to the probability law p(x_{n+1}|x1...xn) used to predict next symbol . iterate and generate data this way . p(D) = prod_i p(xi|x1...x_{i-1}) - generation --> prediction [by sampling, might be slow] . given x1...xn, how to estimate the probability distribution of x_{n+1}? . generate new data D many many time, according to the generative model law . check how frequent each (x1...xn, x_{n+1}) is generated, it is p(x1...x_{n+1}) . compute p(x_{n+1}|x1...xn) = p(x1...x_{n+1}) / p(x1...xn) (Bayes) - compression --> prediction . idea: strings whose encoding is shorter are more probable (coherent with L = - log p: p = 2^-L) ==> complete a string: L(string+'c') --> p('c') proportional to 2^-L(string+'c') . given x1...xn, build all possible completed sequences (x1,...xn,x_{n+1}) : s_k = (x1,...,xn,k-th symbol) . compress each of them, independently, and check their encoding length L_k = size( encode(s_k) ) . define p(k) = 2^-L_k . normalize p(k) by a global factor if necessary (we'll see that the norm is already 1 if the encoder is optimal) . define p(x_{n+1}=k|x1...xn) = p(k) - prediction --> compression [the main point of this lesson] . lesson from entropy: if we have good prior information on the distribution of x_{n+1}|x1...xn, we don't need too much information to be communicated to know which symbol for x_{n+1} is chosen . encode just that information --> so that it is decodable later . entropy = average length of that code; so, intuitively, it is not possible to be shorter (without losing information) . can we prove that bound, and is it possible to reach that limit? how? ==> that's next sections ------------------------------------------------------------------------------------------ III - Encoders -------------- - issue: when to stop reading data . self-delimiting code: . to code an integer n : needs b = log n bits; but needs also to say there are b bits to be read (if b is not known in advance) . ==> need to encode first the number b = log n ; this takes log b = log log n bits . ==> additional log log n bits at the beginning of the encoded binary string . issue: needs to tell to read log log n bits... . ==> recursively, needs log n + log log n + log log log n + ... + log^k n bits with k so that log^k n has only 1 digit (i.e. is 0 or 1) . or just log n + 2 log log n to make things simpler (log log n is already neglectible w.r.t. log n), by starting with 000000...001 (log log n bits) to tell there are as many bits (log log n) to read next to find log n, which will tell how many bits to read to find n. . codes with "end-of-file" character . postfix: read the data, until you meet a special code meaning "end" . to do this: make binary blocks of fixed size, and use one of them (e.g. 0000) for "end" . obvious additional cost: - log p(end) = +4 here . hidden cost: can't use symbols forming 'end' at other places ==> cost = number of blocks * - log(1-p(end)) (factor ~ 0.093 here ==> not negligible when encoding long strings) . prefix codes : when the code of a string cannot be the beginning of the code of another string (otherwise, confusion when decoding [but maybe solved later in the reading of the encoded string) . examples (starting from page 104 in MacKay's book) . can show later that without losing generality, one can assume a code is prefix (otherwise, rebuild a code with same coding lengths) ------------------------------------------------------------------------------------------ IV - Bounds, optimal codes and Kraft inequality ----------------------------------------------- - entropy ==> make use of symbol probabilities . to reduce average coding length: frequent symbols should have a shorter code . hint from the compression/prediction equivalence: 2^-Length(code of s) should be proportional to p(s) ==> we expect to have Length(code of s) = - log p(s) (which we already knew) - Kraft inequality . given an encoder, if (l_i) is the list of the lengths of all possible codewords (with multiplicity), then: the code is uniquely decodeable ==> sum_i 2^-li <= 1 and reciprocally if (l_i) satisfies sum_i 2^-li <= 1 then there exists an encoder (prefix code) whose codewords have this length distribution . if equality, the code is said to be "complete" . Math proof: note S = sum_i 2^-l_i, consider any integer n, develop S^n, count terms, show that it grows at most linearly with n, hence S can't be > 1 (otherwise it would grow exponentially). - more details: denoting l_min = min_i l_i and l_max = max_i l_i, regroup the terms in S^n = sum_{i1,i2,i3...in} 2^-{l_i1 + l_i2 + ... + l_in} as sum_{l = N l_min to N l_max} 2^-l A_l, with A_l being at most 2^l (maximum possible number of binary chains of length l). Consequently S^n <= N l_max. . Proof by drawing: draw the hierarchical table of all possible codewords and realize that one cannot pick many short codewords (see Huffman code below): if one picks 0 and 1, then one cannot pick any other code starting with 0 or 1 anymore; if one picks 00, 01, and 1, one cannot add any more codeword as well. More generally, associate to any codeword the quantity 2^-length(that codeword), which is its width in the hierarchical table. Realize that sum_i 2^-l_i is the width of all codewords taken in that table. And the table has width 1. - lower bound on the compression length . length(encoding X) >= H(X) (proof with Gibbs inequality + Kraft) with equality only if the code length is coherent with the probability : - log p_i - upper bound : it's possible to build a code so that: ==> Source coding theorem (symbol codes). There exists a variable-length encoding C of an ensemble X such that the average length of an encoded symbol, L(C, X), satisfies L(C, X) [H(X), H(X) + 1). Proof: Kraft with l_i = upper_integer(- log p_i) ==> Constructive proof in next section: 1. Huffman coding (best possible with integer number of bits for each symbol) 2. arithmetic coding (non-integer number of bits: reaches entropy rate) ------------------------------------------------------------------------------------------ V - Huffman and arithmetic coding --------------------------------- - Huffman coding . write down one node per symbol, with its probability . iteratively, join the two nodes with the lowest probabilities and form a new node, whose probability is their sum . in the end, this forms a binary tree . code = position in the tree (example: 101 means: go down from the root, first right, then left, then right) . proof of optimality: draw the table of all possible codewords: any change increases the average code length ==> sort symbols by frequency, and fill the "budget" table: no choice! . for a given set of codeword lengths, most frequent symbol is allocated the shortest codeword . optimal set of codeword lengths: suppose another prefix code is better, with different codelengths for the two least-probable symbols A and B (say length(code(A)) > length(code(B))). The longest symbol length necessarily appears twice in an optimal prefix code (and so does Huffman's: length(Huffman(A)) = length(Huffman(B))). As a consequence, there exists another symbol C with length(code(C)) = length(code(A)) > length(code(B)). But p(C) > p(B). Hence this code is suboptimal. Contradiction. So the two least-probable symbols have the same codelegnth. Apply the reasoning recursively. - arithmetic coding . set an order on symbols, and associate to symbol number s the interval [cp(s-1), cp(s)] where cp is the cumulative probability distribution (cp(3) = p(1) + p(2) + p(3)), with the convention p(0) = 0. . any real number in [0,1] falls into only one bin (interval); the uniform law on [0,1] makes it fall in the bin of symbol i with probability p(i). . encode a symbol by writing down in binary a real number falling in its bin . choose the shortest (with fewest bits) such real number so that there is no ambiguity possible (falls into bin i and not i+1) . to encode a string s = (s1, s2,... sn) of symbols, do the same with p(s) => to encode in a streaming way, use the writing p(s) = p(s1) p(s2|s1) p(s3|s1,s2)... which narrows iteratively the interval . reaches the theoretical bound (entropy) ==> optimal . Huffman = approximation of arithmetic; exact when probabilities are powers of 2 (2^-b) . to generate random data with the correct law: decode a random number in [0,1] . how do we know we reached the end of the data? => cost of end-of-file symbol : very low because happens only once in a long time : with proba epsilon: H(p) --> H(p) - epsi H(p) - (1-epsi)log(1-epsi) - epsi log epsi ------------------------------------------------------------------------------------------ VI - Additional remarks ----------------------- - law over integers . code 1: encoding length: log n + 2 log log n => p(n) proportional to 2^{- log n - 2 log log n} = 1/(n (log n)^2 ) ==> its sum does converge indeed => this distribution can be normalized into a probability . code 2: encoding length: log n + log log n + log log log n + ... => p(n) prop to 1/(n log n log log n ...) ==> converge or not? limit case. Kraft says yes. . note: code 3 corresponding to p(n) prop to 1/n? or 1/(n log n)? not possible because sum_n goes to the infinite. . example: tax cheaters . cheaters don't know information theory and pick all digits {0,1,..9} iid (uniform) . while first digit is much more likely to be 1 with the previous law - measure model efficiency/error in bits - about generation: typicality: data (made with n symbols) is typical of a model if its average coding length/symbol is close to the entropy of the model. ------------------------------------------------------------------------------------------ VII - Example: text compression/prediction/generation with Markov chains ------------------------------------------------------------------------ - examples: text generation: see practical session: https://www.lri.fr/~gcharpia/machinelearningcourse/2017/5/TD5.pdf . example with the IID model . example with Markov chains . how to learn Markov chains without storing a full matrix - compression: Hutter prize : https://en.wikipedia.org/wiki/Hutter_Prize http://prize.hutter1.net/ . earn thousands of euros if able to compress that file by 1% more. ------------------------------------------------------------------------------------------ Conclusion ---------- - good models for data = good for everything: prediction, generation, compression - measure model suitability in bits, measure generative model errors in bits (seeing them as compressors) ==> baseline: gzip (Lempel-Ziv) - (- log p) is reachable with non-integer number of bits, thanks to arithmetic coding