{
"cells": [
{
"cell_type": "markdown",
"id": "a9155b0b",
"metadata": {},
"source": [
"# TP3 - Compression, Prediction, Generation: Text Entropy\n",
"\n",
"#### Stella Douka, Guillaume Charpiat \n",
"\n",
"#### Credits: Vincenzo Schimmenti"
]
},
{
"cell_type": "markdown",
"id": "5a7becdf",
"metadata": {
"tags": []
},
"source": [
"### Introduction\n",
"\n",
"In this TP we are interested in compressing and generating texts written in natural languages.\n",
"Given a text of length $n$, a sequence of symbols is just a vector $(x_1, . . . , x_n)$ where each $x_i$ is a symbol i.e. $x_i = a, b, c, \\dots$. We can define the alphabet of possible symbols as $\\mathcal{A} = \\{a_1,a_2,\\dots,a_M\\}$ then each $x_i$ can have $M$ values.\n",
"\n",
"In order to model the sequence of symbols we need a joint probability distribution for each symbol in the sequence, namely $p(X_1 = x_1, X_2 = x_2, \\dots , X_n = x_n)$. If our alphabet had $M$ symbols, for modelling a sequence of length $n$ we would need $M^n$ probabilities. Thus some assumptions are required in order to reduce this dimensionality. In this case we will use two different models for $p$, the IID and the Markov Chain model."
]
},
{
"cell_type": "markdown",
"id": "5f27e3d5",
"metadata": {
"id": "AXFy0UMH_jDH"
},
"source": [
"### IID Model\n",
"The IID model assumes:\n",
"\n",
"$$ p(X_1 = x_1, X_2 = x_2, \\dots , X_n = x_n) = \\prod_{i=1}^n p(X_i = x_i)$$\n",
"\n",
"i.e. that the symbols in a sequence are independent and identically distributed. With this model we need only $M$ probabilities, one for each symbol. One can generalize and use symbols not of a single character but of multiples ones. For example using 3 characters per symbol, the symbols would be of the form $aaa,aab,...,zzz$. When using $k$ characters per symbols in an alphabet of $M$ characters, the needed probabilities would be $M^k$.\n"
]
},
{
"cell_type": "markdown",
"id": "434ff068",
"metadata": {
"id": "ZsX7DGxl5pF7"
},
"source": [
"### Markov Chain Model\n",
"\n",
"The Markov Chain model assume a limited range of dependence of the symbols. Indeed for an order $k$ Markov Chain:\n",
"\n",
"\n",
"$$p(X_i | X_{i-1},X_{i-2},\\dots,X_1) = p(X_i | X_{i-1},X_{i-2},\\dots,X_{i-k})$$\n",
"\n",
"\n",
"The meaning of the above structure is that the $i$-th symbol in the sequence depends only on the previous $k$ symbols. We add the time *invariant assumption*, meaning that the conditional probabilities do not depend on the time index $i$ i.e. $p(X_i | X_{i-1},X_{i-2},\\dots,X_{i-k}) = p(X_{k+1} | X_{k},X_{k-1},\\dots,X_{1})$. The most common and widely used Markov Chain is the Markov Chain of order 1:\n",
"\n",
"$$p(X_i | X_{i-1},X_{i-2},\\dots,X_1) = p(X_i | X_{i-1})$$\n",
"\n",
"In this case the conditional probability $p(X_i|X_{i−1})$ can be expressed using $M^2$\n",
"numbers. Usually this is referred to as the *transition matrix*. Given an alphabet $\\mathcal{A} = \\{a_1,a_2,\\dots,a_M\\}$ the transition matrix can be written as: \n",
"\n",
"$$ \\mathbb{M}_{kl} = p(X_i = a_k| X_{i-1} = a_l) $$"
]
},
{
"cell_type": "markdown",
"id": "8de07c15",
"metadata": {},
"source": [
"### Entropy and Cross-Entropy\n",
"\n",
"\n",
"- For the IID model of order 1 the entropy computation is straightforward: \n",
"$$ H_{IID} = -\\sum_{i=1}^M p(a_i) log p(a_i)$$ \n",
"and consequently, starting from two distributions $p,q$ fitted on two different texts, the cross-entropy:\n",
"$$ CE_{IID} = -\\sum_{i=1}^M p(a_i) log q(a_i)$$\n",
"\n",
"\n",
"- For the MC model of order 1 the entropy is defined as follows: \n",
"$$ H_{MC} = - \\sum_{kl} \\pi(a_k) p(X_i = a_k| X_{i-1} = a_l) log \\left(p(X_i = a_k| X_{i-1} = a_l)\\right)= - \\sum_{kl} \\pi_k\\mathbb{M}_{kl} log \\mathbb{M}_{kl}$$\n",
"where $\\pi$ is the stationary distribution of the Markov Chain i.e. $\\pi_k = \\mathbb{M}_{kl} \\pi_l$. The code to compute the stationary distribution is already given.\n",
"The cross-entropy:\n",
"$$ CE_{IID} = - \\sum_{kl} \\pi_k\\mathbb{M}_{kl} log \\mathbb{M'}_{kl}$$\n",
"with $\\mathbb{M}$ and $\\mathbb{M'}$ are fitted on two different texts.\n"
]
},
{
"cell_type": "markdown",
"id": "c5e6a81f",
"metadata": {},
"source": [
"### Theoretical Questions: \n",
"\n",
"1) Interpret the time invariant assumption associated to our Markov chains in the contex of text generation.\n",
"\n",
"2) How can we rewrite a Markov chain of higher order as a Markov chain of order 1?\n",
"\n",
"3) Given a probability distribution over symbols, how to use it for generating sentences?"
]
},
{
"cell_type": "markdown",
"id": "a4eeceff",
"metadata": {},
"source": [
"### Practical questions\n",
"\n",
"In order to construct our IID and Markov Chain models we need some text. Our source will be a set of classical novels available at: https://www.lri.fr/~gcharpia/informationtheory/TP2_texts.zip\n",
"\n",
"We will use the symbols in each text to learn the probabilities of each model. The alphabet we suggest for the characters to use is string.printable which is made of $\\sim 100$ characters. (see below)\n",
"\n",
"For both models, perform the following steps:\n",
"\n",
"1) For different orders of dependencies, train the model on a novel and compute the associated entropy. What do you observe as the order increases? Explain your observations.\n",
"\n",
"2) Use the other novels as test sets and compute the cross-entropy for each model trained previously. How to handle symbols (or sequences of symbols) not seen in the training set?\n",
"\n",
"3) For each order of dependencies, compare the cross-entropy with the entropy. Explain and interpret the differences.\n",
"\n",
"4) Choose the order of dependencies with the lowest cross-entropy and generate some sentences.\n",
"\n",
"5) Train one model per novel and use the KL divergence in order to cluster the novels.\n",
"\n",
"\n",
"**Hints** : \n",
"\n",
"- In the MC case limit yourself to order $2$ (the computation can become quite expensive). If you have $ M \\sim 100$ characters, for order $1$ you will need a $\\sim 100 \\times 100$ matrix, for order $2$ a $\\sim 10^4 \\times 10^4$ matrix.\n",
"\n",
"- For the second order MC model you need to compute: $p(X_{i+1},X_{i}|X_{i},X_{i-1})$\n",
"\n",
"- It is possible to implement efficiently the two models with dictionaries inPython. For the IID model, a key of the dictionary is simply a symbol and the value is the number of occurrences of the symbol in the text. For a Markov chain, a key of the dictionary is also a symbol, but the value is a vector that contains the number of occurrences of each character of the alphabet. Notice that a symbol may consist of one or several characters. Note also that there is no need to explicitly consider all possible symbols; the ones that are observed in the training set are sufficient.\n",
"\n",
"- A low probability can be assigned to symbols not observed in the training-set."
]
},
{
"cell_type": "markdown",
"id": "aed41454",
"metadata": {},
"source": [
"#### Computing stationary distribution \n",
"\n",
"Here we provide you two version of the function to compute the stationary distirbution of a markov chain and show a small example"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "363b0251",
"metadata": {},
"outputs": [],
"source": [
"#direct way to find pi (can be slow)\n",
"import numpy as np\n",
"\n",
"def Compute_stationary_distribution(P_kl):\n",
" ## P_kl must be the transition matrix from state l to state k!\n",
" evals , evecs = np.linalg.eig(P_kl) \n",
" evec1 = evecs[:,np.isclose(evals , 1)]\n",
" evec1 = evec1 [:,0]\n",
" pi = evec1 / evec1.sum()\n",
" pi = pi.real #stationary probability\n",
" \n",
" return pi \n",
"\n",
"#iteative way (should be faster)\n",
"def Compute_stationary_distribution_it(P_kl, n_it):\n",
" pi = np.random.uniform(size=P_kl.shape[0]) #initial state, can be a random one!\n",
" pi /= pi.sum()\n",
" #print(pi,pi.sum())\n",
" for t in range(n_it): \n",
" pi = np.matmul(P_kl,pi)\n",
" \n",
" return pi"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54d57e60",
"metadata": {},
"outputs": [],
"source": [
"##simple example of computation of stationary distribution \n",
"\n",
"n_it = 1000 ##remind to check that n_it is enough to reach convergence\n",
"P_kl = np.array([[0.7,0.5],[0.3,0.5]])\n",
"Compute_stationary_distribution_it(P_kl,n_it)"
]
},
{
"cell_type": "markdown",
"id": "9d775cd1",
"metadata": {},
"source": [
"#### Defining the Alphabet\n",
"\n",
"Example of uploading a text and filtering out characters which are not in the chosen alphabet"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad8b6a8a",
"metadata": {},
"outputs": [],
"source": [
"import string\n",
"\n",
"def import_text(file_name):\n",
" lines = []\n",
" with open(file_name , encoding='UTF8') as f:\n",
" lines = f.readlines ()\n",
" text = '\\n'.join(lines)\n",
" printable = set(string.printable)\n",
" text = ''.join(filter(lambda x: x in printable , text)) \n",
" return text\n",
"\n",
"text = import_text('./texts/Alighieri.txt')"
]
},
{
"cell_type": "markdown",
"id": "01f69db4",
"metadata": {},
"source": [
"#### IID - MODEL"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1551fa0c",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import math\n",
"from collections import Counter \n",
"\n",
"class IIDModel:\n",
" \"\"\"An interface for the text model\"\"\"\n",
" def __init__(self, order=1):\n",
" print(\"Creation of the model\")\n",
" self.order = order\n",
" \n",
" def process(self,text):\n",
" ##...\n",
"\n",
" def getEntropy(self):\n",
" ##...\n",
"\n",
" def getCrossEntropy(self, text):\n",
" ##...\n",
" \n",
" def generate(self, length):\n",
" ##..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1430d0f6",
"metadata": {},
"outputs": [],
"source": [
"##clustering texts \n",
"\n",
"def KL_divergence(dist1,dist2): \n",
" ##...\n"
]
},
{
"cell_type": "markdown",
"id": "8cab4004",
"metadata": {},
"source": [
"#### MARKOV CHAIN - MODEL"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0d7e515",
"metadata": {},
"outputs": [],
"source": [
"class MarkovModel:\n",
" \"\"\"An interface for the text model\"\"\"\n",
" def __init__(self, order=2):\n",
" print(\"Creation of the model\")\n",
" self.order = order\n",
"\n",
" def process(self, text):\n",
" ##...\n",
"\n",
" def getEntropy(self):\n",
" ##...\n",
" \n",
" def getCrossEntropy(self, text):\n",
" ##...\n",
" \n",
" def generate(self, length):\n",
" ##..."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}