TP1-M1ProML-Correction

Posted on Tue 30 April 2019 in posts

TP1 : inférence Bayésienne et K-moyennes¶

In [6]:

import matplotlib.pyplot as plt
import numpy as np
import scipy.special

Inférence Bayésienne¶

On souhaite tracer la courbe

$$ p(u|n_N,N) = \frac{p(n_N|u,N) p(u)}{p(n_N|N)} $$

vu en cours. On considérera donc les 11 boîtes, chacune contenant 10 boules. Pour cela on codera les fonctions

implémentation de la fonction $p(n_N|u,N)$
implémentation de la fonction $p(n_N|N)$
implémentation de la fonction $p(u|n_N,N)$

Rappel : $p(n_N|u,N) = \frac{N!}{n_N! (N-n_N)!} (u/10)^{n_N} (1-u/10)^{N-n_n}$

In [55]:

Nu = 11 # nb de boites
NB = 10 # nb de boules dans une boite

In [56]:

def proba_Nn_u(nN,u,N):
    C = scipy.special.binom(N, nN)
    pr = (u/NB)**nN*(1-u/NB)**(N-nN)
    return C*pr

# test
for i in range(20):
    print("Proba nN=",i," = ",proba_Nn_u(i,2,20))

Proba nN= 0  =  0.011529215046068483
Proba nN= 1  =  0.05764607523034242
Proba nN= 2  =  0.13690942867206327
Proba nN= 3  =  0.20536414300809488
Proba nN= 4  =  0.21819940194610074
Proba nN= 5  =  0.17455952155688062
Proba nN= 6  =  0.10909970097305038
Proba nN= 7  =  0.054549850486525185
Proba nN= 8  =  0.022160876760150862
Proba nN= 9  =  0.007386958920050286
Proba nN= 10  =  0.0020314137030138287
Proba nN= 11  =  0.00046168493250314287
Proba nN= 12  =  8.65659248443393e-05
Proba nN= 13  =  1.3317834591436813e-05
Proba nN= 14  =  1.6647293239296018e-06
Proba nN= 15  =  1.6647293239296019e-07
Proba nN= 16  =  1.3005697843200012e-08
Proba nN= 17  =  7.65041049600001e-10
Proba nN= 18  =  3.1876710400000044e-11
Proba nN= 19  =  8.38860800000001e-13

In [60]:

# exemple
NTir = 100
U = 3
pr = np.zeros(NTir+1)
for i in range(NTir+1):
    pr[i] = proba_Nn_u(i,U,NTir)
    
xr = np.arange(0,1.00001,1/NTir)
plt.plot(xr,pr)

Out[60]:

[<matplotlib.lines.Line2D at 0x7fa066267b38>]

In [61]:

def proba_Nn(nN,N):
    sum = 0
    for u in range(Nu):
        sum += proba_Nn_u(nN,u,N)
    return sum/Nu

In [62]:

def CalcPu(Nn,N):
    Pu = np.zeros(Nu)
    for u in range(Nu):
        Pu[u] = proba_Nn_u(Nn,u,N)/proba_Nn(Nn,N)/Nu
    return Pu

In [66]:

# Calcul de P(u|...)

#Nb de boules noires
NNoires = 3
#Nb de tirages
NTirages = 10

Pu = CalcPu(NNoires,NTirages)
Pu10 = CalcPu(NNoires*5,NTirages*5)
Pu100 = CalcPu(NNoires*20,NTirages*20)
    
plt.plot(Pu)
plt.plot(Pu10)
plt.plot(Pu100)

Out[66]:

[<matplotlib.lines.Line2D at 0x7fa0660ddb70>]

Le modèle Bayésien naïf¶

On va passer ici à un modèle bayésien pour classer nos données. Le modèle Bayésien naïf fonctionne de la façon suivante : on écrit la probabilité d'être dans la classe $\C_k$ étant donnée un chiffre $\vec{x}$ à l'aide du théorème de Bayes :

$$ p(C_k|\vec{x}) = \frac{p(\vec{x}|C_k) p(C_k)}{p(\vec{x})} $$

Il faut maintenant définir la vraisemblance et le prior. Selon vous, quelle serait une bonne valeur pour la distribution à priori $p(C_k)$ ?

Le modèle Gaussien¶

On va utiliser pour la vraisemblance une modélisation Gaussienne :

$$ p(\vec{x}|C_k) = \prod_{i=1}^D \left( \frac{1}{\sqrt{2\pi}} e^{-(x_i - m_i^{(k)})^2/2} \right) $$

Votre travail :

Comment trouver les paramètres $\vec{m}^{(k)}$ de la classe $k$ ? (et faites le)
Tester les performances de ce modèle sur les données. Utiliser le train_set pour calibrer le paramètres et le test set pour calculer les performances.

In [10]:

# On récupère les données pour deux chiffres
d1 = 3
d2 = 7
X1 = train_set[0][np.where(train_set[1]==d1)]
X2 = train_set[0][np.where(train_set[1]==d2)]
X1_test = test_set[0][np.where(test_set[1]==d1)]
X2_test = test_set[0][np.where(test_set[1]==d2)]

In [84]:

X1.shape[0]/(X1.shape[0]+X2.shape[0])

Out[84]:

0.49639937718956795

In [85]:

m1 = np.mean(X1,axis=0)
m2 = np.mean(X2,axis=0)
f,ax = plt.subplots(1,2)
ax[0].imshow(m1.reshape(28,28))
ax[1].imshow(m2.reshape(28,28))

Out[85]:

<matplotlib.image.AxesImage at 0x7fe463dde6d8>

In [72]:

# on va considérer que p(C_k) = 1/2
d11 = np.linalg.norm(X1_test - m1,axis=1)
d12 = np.linalg.norm(X1_test - m2,axis=1)
d21 = np.linalg.norm(X2_test - m1,axis=1)
d22 = np.linalg.norm(X2_test - m2,axis=1)

print("Data1, Err=",np.mean(d11>d12)," Correct=",np.mean(d11<d12))
print("Data2, Err=",np.mean(d21<d22)," Correct=",np.mean(d21>d22))

Data1, Err= 0.039603960396039604  Correct= 0.9603960396039604
Data2, Err= 0.02529182879377432  Correct= 0.9747081712062257

On va utiliser pour la vraisemblance une modélisation Bernoulli :

$$ p(\vec{x}|C_k) = \prod_{i=1}^D p_{ki}^{x_i} (1-p_{ki})^{1-x_i} $$

où $p_{ki}$ représente la probabilité que le pixel $i$ est allumé pour la classe $k$.

Commencer par binariser les données (les mettres à 0 ou à 1) par rapport à un seuil (par exemple $0.3$).
Comment trouver les paramètres $p_{ki}$ ? (et faites le!)
Dans cette approche, les valeurs $p_{ki}=1$ ou $p_{ki}=0$ peuvent poser problème (comportement trop abrupte). Vous ferez attention à régulariser ces valeurs en introduisant une valeur minimale/maximale différente (par exemple $p_{ki} = 10^{-3}$ ou $p_{ki} = 1-10^{-3}$).
Tester les performances de ce modèle sur les mêmes données que précédement

In [82]:

ϵ = 1e-3
p1 = np.mean((X1>0.3),axis=0)
p1[np.where(p1<ϵ)] = ϵ
p1[np.where(p1>(1-ϵ))] = 1-ϵ

p2 = np.mean((X2>0.3),axis=0)
p2[np.where(p2<ϵ)] = ϵ
p2[np.where(p2>(1-ϵ))] = 1-ϵ

f,ax = plt.subplots(1,2)
ax[0].imshow(p1.reshape(28,28))
ax[1].imshow(p2.reshape(28,28))

Out[82]:

<matplotlib.image.AxesImage at 0x7fe463ebee80>

In [67]:

X1_test_bin = 1*(X1_test > 0.3)
X2_test_bin = 1*(X2_test > 0.3)

$$ \log(p(\vec{x}|C_k)) = \sum_i^D x_i \log(p_{ik}) + (1-x_i) \log(1-p_{ik}) $$

In [78]:

d11 = np.sum(np.multiply(X1_test_bin,np.log(p1)),axis=1) + np.sum(np.multiply(1-X1_test_bin,np.log(1-p1)),axis=1)
d12 = np.sum(np.multiply(X1_test_bin,np.log(p2)),axis=1) + np.sum(np.multiply(1-X1_test_bin,np.log(1-p2)),axis=1)

d21 = np.sum(np.multiply(X2_test_bin,np.log(p1)),axis=1) + np.sum(np.multiply(1-X2_test_bin,np.log(1-p1)),axis=1)
d22 = np.sum(np.multiply(X2_test_bin,np.log(p2)),axis=1) + np.sum(np.multiply(1-X2_test_bin,np.log(1-p2)),axis=1)

print("Data1, Err=",np.mean(d11<d12)," Correct=",np.mean(d11>d12))
print("Data2, Err=",np.mean(d22<d21)," Correct=",np.mean(d22>d21))

Data1, Err= 0.033663366336633666  Correct= 0.9663366336633663
Data2, Err= 0.027237354085603113  Correct= 0.9727626459143969

In [ ]: