------------------------------------------------------------------------------------------
             Lesson 2 :   Compression/Prediction/Generation equivalence
------------------------------------------------------------------------------------------


I - Concepts: compression/prediction/generation
-----------------------------------------------

- lossless compression / prediction / generative models
  . data = sequence of symbols, from a known predefined alphabet (discrete setting)
  . lossless compression: 
    . encoder: data --> binary string    injective! otherwise not decodable  (known as: uniquely decodeable code)
    . decoder: binary string --> data     encoder not surjective ==> not optimal compression (easy to build a better compressor)
    . lossless <==>  decoder o encoder = Id (no loss)
  . prediction
    . given a sequence of observed variables (x1,...xn), predict x_{n+1}
    . i.e. give the probability distribution of each possible label (discrete space)
    . so: predictor = function : data (x1...xn) -->  p(x_{n+1}|x1,...xn) probability of next symbol
  . generative model
    . model able to generate new data 
    . data D is generated according to (= as often as) its model probability p(D)


------------------------------------------------------------------------------------------

II - From one concept to the other
----------------------------------

- prediction --> generation  [easy]
  . start from x1 (given or taken from a learned distribution)
  . choose x_{n+1} according to the probability law p(x_{n+1}|x1...xn) used to predict next symbol
  . iterate and generate data this way
  . p(D) = prod_i p(xi|x1...x_{i-1})

- generation --> prediction  [by sampling, might be slow]
  . given x1...xn, how to estimate the probability distribution of x_{n+1}?
  . generate new data D many many time, according to the generative model law
  . check how frequent each (x1...xn, x_{n+1}) is generated, it is p(x1...x_{n+1})
  . compute p(x_{n+1}|x1...xn) = p(x1...x_{n+1}) / p(x1...xn)    (Bayes)

- compression --> prediction 
  . idea: strings whose encoding is shorter are more probable (coherent with L = - log p: p = 2^-L)
    ==> complete a string: L(string+'c') --> p('c') proportional to 2^-L(string+'c')
  . given x1...xn, build all possible completed sequences (x1,...xn,x_{n+1}) : s_k = (x1,...,xn,k-th symbol)
  . compress each of them, independently, and check their encoding length L_k = size( encode(s_k) )
  . define p(k) = 2^-L_k
  . normalize p(k) by a global factor if necessary (we'll see that the norm is already 1 if the encoder is optimal)
  . define p(x_{n+1}=k|x1...xn) = p(k)

- prediction --> compression [the main point of this lesson]
  . lesson from entropy: if we have good prior information on the distribution of x_{n+1}|x1...xn, we don't need too much information to be communicated to know which symbol for x_{n+1} is chosen
  . encode just that information  -->  so that it is decodable later
  . entropy = average length of that code; so, intuitively, it is not possible to be shorter (without losing information) 
  . can we prove that bound, and is it possible to reach that limit? how?
    ==> that's next sections


------------------------------------------------------------------------------------------

III - Encoders
--------------

- issue: when to stop reading data

  . self-delimiting code:
    . to code an integer n : needs b = log n bits; but needs also to say there are b bits to be read (if b is not known in advance)
    . ==> need to encode first the number b = log n ; this takes log b = log log n bits
    . ==> additional log log n bits at the beginning of the encoded binary string
    . issue: needs to tell to read log log n bits... 
    . ==> recursively, needs log n + log log n + log log log n + ... + log^k n bits with k so that log^k n has only 1 digit (i.e. is 0 or 1)
    . or just log n + 2 log log n to make things simpler (log log n is already neglectible w.r.t. log n), by starting with 000000...001 (log log n bits) to tell there are as many bits (log log n) to read next to find log n, which will tell how many bits to read to find n.

  . codes with "end-of-file" character
    . postfix: read the data, until you meet a special code meaning "end"
    . to do this: make binary blocks of fixed size, and use one of them (e.g. 0000) for "end"
    . obvious additional cost: - log p(end) = +4 here
    . hidden cost: can't use symbols forming 'end' at other places ==> cost = number of blocks * - log(1-p(end))    (factor ~ 0.093 here ==> not negligible when encoding long strings)

  . prefix codes : when the code of a string cannot be the beginning of the code of another string (otherwise, confusion when decoding [but maybe solved later in the reading of the encoded string)
    . examples (starting from page 104 in MacKay's book)
    . can show later that without losing generality, one can assume a code is prefix (otherwise, rebuild a code with same coding lengths)


------------------------------------------------------------------------------------------

IV - Bounds, optimal codes and Kraft inequality
-----------------------------------------------

- entropy ==> make use of symbol probabilities
  . to reduce average coding length: frequent symbols should have a shorter code
  . hint from the compression/prediction equivalence: 2^-Length(code of s) should be proportional to p(s)
     ==> we expect to have Length(code of s) = - log p(s)  (which we already knew)


- Kraft inequality
  . given an encoder, if (l_i) is the list of the lengths of all possible codewords (with multiplicity), then:
      the code is uniquely decodeable ==> sum_i 2^-li <= 1
    and reciprocally
      if (l_i) satisfies sum_i 2^-li <= 1   then   there exists an encoder (prefix code) whose codewords have this length distribution

  . if equality, the code is said to be "complete"

  . Math proof: note S = sum_i 2^-l_i, consider any integer n, develop S^n, count terms, show that it grows at most linearly with n, hence S can't be > 1 (otherwise it would grow exponentially).
    - more details: denoting l_min = min_i l_i and l_max = max_i l_i, regroup the terms in S^n = sum_{i1,i2,i3...in} 2^-{l_i1 + l_i2 + ... + l_in} as sum_{l = N l_min to N l_max} 2^-l A_l, with A_l being at most 2^l (maximum possible number of binary chains of length l). Consequently S^n <= N l_max.
  . Proof by drawing: draw the hierarchical table of all possible codewords and realize that one cannot pick many short codewords (see Huffman code below): if one picks 0 and 1, then one cannot pick any other code starting with 0 or 1 anymore; if one picks 00, 01, and 1, one cannot add any more codeword as well. More generally, associate to any codeword the quantity 2^-length(that codeword), which is its width in the hierarchical table. Realize that sum_i 2^-l_i is the width of all codewords taken in that table. And the table has width 1.

- lower bound on the compression length
  . length(encoding X) >= H(X)
    (proof with Gibbs inequality + Kraft)
    with equality only if the code length is coherent with the probability : - log p_i

- upper bound : it's possible to build a code so that:
    ==> Source coding theorem (symbol codes). There exists a variable-length
        encoding C of an ensemble X such that the average length of an encoded 
        symbol, L(C, X), satisfies L(C, X)  [H(X), H(X) + 1).

    Proof: Kraft with l_i = upper_integer(- log p_i)

    ==> Constructive proof in next section:
    1. Huffman coding (best possible with integer number of bits for each symbol)
    2. arithmetic coding (non-integer number of bits: reaches entropy rate)


------------------------------------------------------------------------------------------

V - Huffman and arithmetic coding
---------------------------------

- Huffman coding
  . write down one node per symbol, with its probability
  . iteratively, join the two nodes with the lowest probabilities and form a new node, whose probability is their sum
  . in the end, this forms a binary tree
  . code = position in the tree (example: 101 means: go down from the root, first right, then left, then right)
  . proof of optimality: draw the table of all possible codewords: any change increases the average code length 
    ==> sort symbols by frequency, and fill the "budget" table: no choice! 
     . for a given set of codeword lengths, most frequent symbol is allocated the shortest codeword
     . optimal set of codeword lengths: suppose another prefix code is better, with different codelengths for the two least-probable symbols A and B (say length(code(A)) > length(code(B))). The longest symbol length necessarily appears twice in an optimal prefix code (and so does Huffman's: length(Huffman(A)) = length(Huffman(B))). As a consequence, there exists another symbol C with length(code(C)) = length(code(A)) > length(code(B)). But p(C) > p(B). Hence this code is suboptimal. Contradiction. So the two least-probable symbols have the same codelegnth. Apply the reasoning recursively.
 
- arithmetic coding
  . set an order on symbols, and associate to symbol number s the interval [cp(s-1), cp(s)] where cp is the cumulative probability distribution (cp(3) = p(1) + p(2) + p(3)), with the convention p(0) = 0.
  . any real number in [0,1] falls into only one bin (interval); the uniform law on [0,1] makes it fall in the bin of symbol i with probability p(i).
  . encode a symbol by writing down in binary a real number falling in its bin
  . choose the shortest (with fewest bits) such real number so that there is no ambiguity possible (falls into bin i and not i+1)
  . to encode a string s = (s1, s2,... sn) of symbols, do the same with p(s)
    => to encode in a streaming way, use the writing p(s) = p(s1) p(s2|s1) p(s3|s1,s2)... which narrows iteratively the interval
  . reaches the theoretical bound (entropy) ==> optimal
  . Huffman = approximation of arithmetic; exact when probabilities are powers of 2  (2^-b)
  . to generate random data with the correct law: decode a random number in [0,1]
  . how do we know we reached the end of the data?
    => cost of end-of-file symbol : very low because happens only once in a long time : with proba epsilon: H(p) --> H(p) - epsi H(p) - (1-epsi)log(1-epsi) - epsi log epsi


------------------------------------------------------------------------------------------

VI - Additional remarks
-----------------------


- law over integers
  . code 1: encoding length: log n + 2 log log n
     => p(n) proportional to  2^{- log n - 2 log log n}  = 1/(n (log n)^2 )  
     ==> its sum does converge indeed => this distribution can be normalized into a probability
  . code 2: encoding length: log n + log log n + log log log n + ...
     => p(n) prop to 1/(n log n log log n ...)
     ==> converge or not? limit case. Kraft says yes.
  . note: code 3 corresponding to p(n) prop to 1/n? or 1/(n log n)? not possible because sum_n goes to the infinite.
  . example: tax cheaters
     . cheaters don't know information theory and pick all digits {0,1,..9} iid (uniform)
     . while first digit is much more likely to be 1 with the previous law


- measure model efficiency/error in bits

- about generation: typicality: data (made with n symbols) is typical of a model if its average coding length/symbol is close to the entropy of the model.


------------------------------------------------------------------------------------------

VII - Example: text compression/prediction/generation with Markov chains
------------------------------------------------------------------------

- examples: text generation: see practical session: https://www.lri.fr/~gcharpia/machinelearningcourse/2017/5/TD5.pdf
  . example with the IID model
  . example with Markov chains
    . how to learn Markov chains without storing a full matrix

- compression: Hutter prize : https://en.wikipedia.org/wiki/Hutter_Prize  http://prize.hutter1.net/ 
  . earn thousands of euros if able to compress that file by 1% more.

------------------------------------------------------------------------------------------

Conclusion
----------

- good models for data = good for everything: prediction, generation, compression
- measure model suitability in bits, measure generative model errors in bits (seeing them as compressors)
  ==> baseline: gzip (Lempel-Ziv)

- (- log p) is reachable with non-integer number of bits, thanks to arithmetic coding