{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TP 3\n", "\n", "Le but du TP est de d'intéresser dans un premier temps aux distributions de variables jointes, et ensuite de voir des exemples du théorème de Bayes et des probabilités conditionnelles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " Question\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " Correction\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " Remarques\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " Sujet\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distribution jointe \n", "\n", "On va regarder un exemple simpe de distribution jointe. On considère le jeu de données `count_1w.txt` qui a répertorié les ~ 300 000 mots les plus fréquents (en anglais) d'un ensemble de livres. La structure du fichier est la suivante : sur chaque ligne on a un mot suivit du nombre d'occurence. Les mots sont ordonnées par occurence décroissante.\n", "\n", " 1. Afficher la distribution empirique des mots. Quelle est le comportement de la distribution empirique pour les mots les moins fréquents ?\n", " 2. On va considérer chaque mot séparément. On veut regarder la probabilité d'une lettre d'être au moins présente une fois dans un mot. Calculer cette distribution empirique.\n", " 3. Calculer maintenant la distribution jointe d'avoir deux lettres qui se suivent dans un mot (indépendement de la fréquence du mot). Afficher la distritbution à l'aide de la fonction imshow." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "try: \n", " import seaborn as sns\n", " sns.set()\n", " sns.set_style(\"whitegrid\")\n", " sns.set_context(\"poster\")\n", "except ImportError:\n", " print('seaborn allow pretty plots')\n", "\n", "mpl.rcParams['figure.figsize'] = [8.0, 6.0]\n", "mpl.rcParams['figure.dpi'] = 80\n", "mpl.rcParams['savefig.dpi'] = 100\n", "\n", "mpl.rcParams['font.size'] = 10\n", "mpl.rcParams['axes.labelsize'] = 10\n", "mpl.rcParams['axes.titlesize'] = 17\n", "mpl.rcParams['ytick.labelsize'] = 10\n", "mpl.rcParams['xtick.labelsize'] = 10\n", "mpl.rcParams['legend.fontsize'] = 'large'\n", "mpl.rcParams['figure.titlesize'] = 'medium'\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "mpl.rc('axes', labelsize=15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 0 | \n", "1 | \n", "
---|---|---|
0 | \n", "the | \n", "23135851162 | \n", "
1 | \n", "of | \n", "13151942776 | \n", "
2 | \n", "and | \n", "12997637966 | \n", "
3 | \n", "to | \n", "12136980858 | \n", "
4 | \n", "a | \n", "9081174698 | \n", "