{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Generative modelling in deep learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generative modelling in machine learning can aim at achieving different goals.\n",
"\n",
"The first, obvious one is that a generative model can be used to generate more data, to be used afterwards by an other algorithm. While a generative model cannot create more information to solve the issue of having too small datasets, it could be used to solve anonymity questions. Typically, sharing a generative model trained on private data could allow the exploitation of the statistical property of this data without sharing the data itself (which can be protected by privacy matters for example).\n",
"\n",
"Another goal is to use generative modelling to better understand the data at hand. This is based on the hypothesis that a model that successfully learned to generate (and generalize) a dataset should have internally learned some efficient and compressed representation of the information contained in the data. In this case, analysing a posteriori the learned representation may give us insights on the data itself.\n",
"\n",
"The notion of a generative model however needs to be more formally specified, in order to work with. What does it mean for the model to generate data that \"looks like\" the original dataset? A mathematical formulation of that is necessary, in order to define a training objective that can be used efficiently. Having some expert rate the quality of all generated datapoints one by one is definitely not an option.\n",
"\n",
"Thus, modelling our data and models as probability distributions comes to the rescue. If we consider our data as coming from some underlying probability distribution, that we will name $p_D$, our goal is thus to train our model to represent another probability distribution, which we will name $p_\\theta$, that should be some good approximation of $p_D$. Given we only know $p_D$ through some set of realisations from it (the dataset), we can never hope to learn it exactly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1: Can you name some ways to compare two given distributions $p_D$ and $p_\\theta$?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most comparison methods can be separated into two kinds: those that compare the density of the distributions ($p_\\theta(x)$ vs $p_D(x)$), and those that compare the values sampled from them directly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2: Given we want to use them as an optimisation objective, what are the caveats to keep in mind about these two kinds?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this work, we will focus on two of the most widely used generative models based on deep neural networks: Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs), in order to compare them and understand their strenghts and weaknesses."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generative Adversarial Networks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"GANs structure is based on modelling the distribution $p_\\theta$ as a learned deterministic function applied to a standard noise. Sampling from it is thus done as follows: first, some noise is sampled from a standard N-dimentional gaussian distribution: $\\epsilon \\sim \\mathcal{N}(0;I)$, and then the output is computed as a deterministic function $x = f_\\theta(\\epsilon)$. The function $f_\\theta$ is implemented as a neural network, $\\theta$ representing its learned parameters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3: What is, a priori, the impact of the choice of N, the dimension of the input noise?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By construction, this generator structure only allows sampling of the distribution $p_\\theta$, and does not allow the computation of the density $p_\\theta(x)$ (at least not without strong assumptions on $f_\\theta$. Such a model seems to need a comparison method based on samples to be trained.\n",
"\n",
"The smart idea of GANs is to instead use another neural network to model the objective. Another neural network is introduced: a classifier (that we call the discriminator) which is trained to differentiate examples from the dataset from examples generated by $p_\\theta$. The reasoning is as follows:\n",
"\n",
"The discriminator (whose output is denoted by $D(x)$) is trained using a classic discrimination loss, so that $D(x)$ can be interpreted as the probability that $x$ came from the real dataset:\n",
"\n",
"$$ \\mathcal{L}_D = \\mathbb{E}_{p_D} \\left[ -\\log D(x) \\right] + \\mathbb{E}_{p_\\theta} \\left[ -\\log \\left(1-D(x)\\right) \\right] $$\n",
"\n",
"From that, it can be shown that the optimal discriminator is given by $D(x) = \\frac{p_D(x)}{p_\\theta(x) + p_D(x)}$, and when reached its loss takes a specific value:\n",
"\n",
"$$ \\mathcal{L}_D = 2 \\left( \\log 2 - JSD(p_\\theta \\| p_D) \\right) $$\n",
"\n",
"So, training the generator network to *maximize* the same loss would, assuming the discriminator is always trained to optimality, minimize the JSD between $p_\\theta$ and $p_D$, and thus bring $p_\\theta$ closer to $p_D$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q4: Can you anticipate a caveat of using the JSD as a training objective for the generator?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Having the generator trained to maximize $\\mathcal{L}_D$ is equivalent to setting its training loss to $ \\mathcal{L}_G = \\mathbb{E}_{p_\\theta} \\log(1-D(x)) $."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q5: This loss only gives feedback to the generator on samples it generated, what can this imply?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will now work on implementing a GAN on a simple toy problem, to get a feeling of its behavior and test our theoretical insights. For this we will use the `pytorch` library.\n",
"\n",
"While a real problem would be generating images for example (each datapoint $x$ would then be a different image), this is a kind of task that easily requires intensive CPU/GPU power, and image datasets are difficult to visualize from a geometric point of view (even small images contains hundreds of pixels, and nobody can visualize points in a 100-dimensional space). So instead we will focus on points in the plane: each datapoint $x$ will actually be a couple of numbers $(x1, x2)$, and our target dataset will be 25 Gaussian distributions with small variance, distributed on a $5\\times 5$ grid."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import torch\n",
"\n",
"# Our dataset is mathematically defined, we can generate batches on the fly and enjoy\n",
"# an infinite-size dataset\n",
"def generate_batch(batchlen):\n",
" \"\"\"This function generates a batch of length 'batchlen' from the 25-gaussian dataset.\n",
" \n",
" return a torch tensor of dimensions (batchlen, 2)\n",
" \"\"\"\n",
" # to sample from the gaussian mixture, we first sample the means for each point, then\n",
" # add a gaussian noise with small variance\n",
" samples = torch.multinomial(torch.tensor([0.2,0.2,0.2,0.2,0.2]), 2*batchlen, replacement=True)\n",
" means = (2.0 * (samples - 2.0)).view(batchlen,2).type(torch.FloatTensor)\n",
" return torch.normal(means, 0.05)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's plot a batch, to see what the dataset looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"batch = generate_batch(256)\n",
"\n",
"plt.scatter(batch[:,0], batch[:,1], s=2.0, label='Batch of data from our gaussian mixture dataset')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now need to define our two neural networks, the generator and the discriminator. The generator will take as input a value $z$ sampled from a Gaussian prior, and output a value $x$ (thus a couple $(x_1,x_2)$). The discriminator takes as input a value $x$, and is a binary classifier.\n",
"\n",
"When representing a binary classifier with a neural network, it is better for the last layer to consist of only a sigmoid activation, so that the output values will be between 0 and 1 and stand for the probability (according to the classifier) that the input is of the first class. The output is thus of the form $\\mathrm{sigmoid}(h)$. The loss involves quantities such as $-\\log(\\mathrm{sigmoid}(h))$. For numerical stability reasons, it is recommended to rewrite the loss in order to make use of the $\\mathrm{softplus}$ function defined by $\\mathrm{softplus}(h) = \\log(1 + \\exp(h))$ and provided in PyTorch as `torch.softplus`.\n",
"As the $\\mathrm{softplus}(h)$ formulation of the loss already contains the sigmoid activation, in practice the last layer of the discrimator network will not have any activation function (being just $h$)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"\n",
"# Choose a value for the prior dimension\n",
"PRIOR_N = 1\n",
"\n",
"# Define the generator\n",
"class Generator(nn.Module):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.fc1 = nn.Linear(PRIOR_N, 2)\n",
" self.fc2 = nn.Linear(2, 2)\n",
" \n",
" def __call__(self, z):\n",
" h = F.relu(self.fc1(z))\n",
" return self.fc2(h)\n",
" \n",
" def generate(self, batchlen):\n",
" z = torch.normal(torch.zeros(batchlen, PRIOR_N), 1.0)\n",
" return self.__call__(z)\n",
" \n",
"\n",
"# Define the discriminator\n",
"class Discriminator(nn.Module):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.fc1 = nn.Linear(2, 2)\n",
" self.fc2 = nn.Linear(2, 1)\n",
" \n",
" def __call__(self, x):\n",
" h = F.relu(self.fc1(x))\n",
" return self.fc2(h)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With these classes in shape, only the training loop is still missing. To stick with the mathematical GAN framework, we should train the discriminator until convergence between each training step of the generator. This is not practical for two reasons: first it takes a lot of time, and second if the discriminator is too good, its gradient will vanish (as seen in **Q4**) and thus no information will be passed to the generator.\n",
"\n",
"We will then train the discriminator a fixed number of times between each training iteration of the generator."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Number of times to train the discriminator between two generator steps\n",
"TRAIN_RATIO = 1\n",
"# Total number of training iterations for the generator\n",
"N_ITER = 20001\n",
"# Batch size to use\n",
"BATCHLEN = 128\n",
"\n",
"generator = Generator()\n",
"optim_gen = torch.optim.Adam(generator.parameters(), lr=0.0001, betas=(0.5,0.9))\n",
"discriminator = Discriminator()\n",
"optim_disc = torch.optim.Adam(discriminator.parameters(), lr=0.0001, betas=(0.5,0.9))\n",
"\n",
"for i in range(N_ITER):\n",
" # train the discriminator\n",
" for _ in range(TRAIN_RATIO):\n",
" discriminator.zero_grad()\n",
" real_batch = generate_batch(BATCHLEN)\n",
" fake_batch = generator.generate(BATCHLEN)\n",
" # Compute here the discriminator loss, using functions like torch.sum, torch.exp, torch.log,\n",
" # torch.softplus, using real_batch and fake_batch\n",
" disc_loss = 0 # FILL HERE\n",
" disc_loss.backward()\n",
" optim_disc.step()\n",
" # train the generator\n",
" generator.zero_grad()\n",
" fake_batch = generator.generate(BATCHLEN)\n",
" # Compute here the generator loss, using fake_batch\n",
" gen_loss = 0 # FILL HERE\n",
" gen_loss.backward()\n",
" optim_gen.step()\n",
" if i%100 == 0:\n",
" print('step {}: discriminator: {:.3e}, generator: {:.3e}'.format(i, float(disc_loss), float(gen_loss)))\n",
" # plot the result\n",
" real_batch = generate_batch(1024)\n",
" fake_batch = generator.generate(1024).detach()\n",
" plt.scatter(real_batch[:,0], real_batch[:,1], s=2.0, label='real data')\n",
" plt.scatter(fake_batch[:,0], fake_batch[:,1], s=2.0, label='fake data')\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Complete the previous code and train your model.\n",
"\n",
"Depending on your choice of parameters, the training may not go well at all, with the generator completely collapsing quickly at the beginning of the training. It has been observed by the litterature that the generator's loss $\\mathcal{L}_G = \\mathbb{E}_{p_\\theta} \\log(1-D(x))$ is often to blame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q6: Why could we anticipate that this loss could cause this?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This issue is solved by replacing the generator loss by an alternative loss: $\\mathcal{L}_G = \\mathbb{E}_{p_\\theta} [ -\\log D(x) ]$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q7: Inspect the impact of these different factors:**\n",
"\n",
"- depth / width of the generator network\n",
"- depth / width of the discriminator network\n",
"- impact of `TRAIN_RATIO`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For further readings on GANs, you can see the following papers:\n",
"\n",
"- Generative Adversarial Networks *(Goodfellow et al.)*: [arXiv:1406.2661](https://arxiv.org/abs/1406.2661)\n",
"- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks *(Radford et al.)*: [arXiv:1511.06434](https://arxiv.org/abs/1511.06434)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variational AutoEncoders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An other well-known approach to generative modelling is embodied by Variational AutoEncoders (VAEs). While the generative model itself and the procedure to sample it is similar to GANs, the way it is trained is not.\n",
"\n",
"The main goal of VAEs is to optimize the likelihood of the real data according to the generative model. In other words, maximize $\\mathbb{E}_{p_D} \\log p_\\theta(x)$, which is equivalent to minimizing $D_{KL}(p_D \\| p_\\theta)$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q8: Prove this equivalence.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, the class of distributions for which $\\log p_\\theta(x)$ can be analytically computed and optimized is very restricted, and not suitable for real world problems. The main idea of the VAE is thus to introduce a latent variable $z$ and decompose the distribution like so: $p_\\theta(x, z) = p_\\theta(x | z) p(z)$. Where here $p(z)$ is some fixed prior and $p_\\theta(x | z)$ is a simple distribution whose parameters are the output of a neural network.\n",
"\n",
"For example, you could have $p(z)$ be a standard $\\mathcal{N}(0;1)$ and $p_\\theta(x | z)$ be defined as a gaussian $\\mathcal{N}(\\mu_\\theta(z); \\sigma_\\theta(z))$ where $\\mu_\\theta(z)$ and $\\sigma_\\theta(z)$ are created by the neural network you will train. In this case, the resulting distribution $p_\\theta(x) = \\int_z p_\\theta(x|z)p(z)ds$ is an infinite mixture of gaussians, which is a much more expressive class of distributions.\n",
"\n",
"Now, this cannot stop here, as we are not able to analitically compute the density $p_\\theta(x)$. The second main idea of the VAE is to introduce an other, auxilliary distribution: $q_\\phi(z | x)$, which will be modelled by a neural network similarly to $p_\\theta(x | z)$. Introducing it allows us to create a lower bound for $\\log p_\\theta(x)$:\n",
"\n",
"$$\\log p_\\theta(x) = \\mathbb{E}_{z \\sim q_\\phi} \\log p_\\theta(x) = \\mathbb{E}_{z \\sim q_\\phi} \\left[ \\log p_\\theta(x) \\frac {q_\\phi(z|x)}{q_\\phi(z|x)} \\right]$$\n",
"\n",
"Following Bayes theorem, $p_\\theta(x) p_\\theta(z|x) = p_\\theta(x, z) = p_\\theta(x|z) p(z)$, so we get:\n",
"\n",
"$$\\log p_\\theta(x) = \\mathbb{E}_{z \\sim q_\\phi} \\left[ \\log \\frac{p_\\theta(x|z) p(z)}{p_\\theta(z|x)} \\frac {q_\\phi(z|x)}{q_\\phi(z|x)} \\right]$$\n",
"\n",
"Re-organizing the terms:\n",
"\n",
"$$\\log p_\\theta(x) = \\mathbb{E}_{z \\sim q_\\phi} \\log \\frac{q_\\phi(z|x)}{p_\\theta(z|x)} - \\mathbb{E}_{z \\sim q_\\phi} \\log \\frac{p(z)}{q_\\phi(z|x)} + \\mathbb{E}_{z \\sim q_\\phi} \\log p_\\theta(x | z)$$\n",
"\n",
"This can be re-expressed like so:\n",
"\n",
"$$\\log p_\\theta(x) = D_{KL}(q_\\phi(z | x) \\| p_\\theta(z | x)) - D_{KL}(q_\\phi(z | x) \\| p(z)) + \\mathbb{E}_{z \\sim q_\\phi} \\log p_\\theta(x|z)$$\n",
"\n",
"The 3 terms of this equality can be interpreted like so:\n",
"\n",
"- the first term measures how much $q_\\phi(z | x)$ is similar to $p_\\theta(z | x)$, or in other words is a good inverse of $p_\\theta(x | z)$\n",
"- the second term measures how similar $q_\\phi(z|x)$ is from the latent prior $p(z)$\n",
"- the third term is linked to how likely $p_\\theta$ is to yield the given $x$ when $z$ is sampled from $q_\\phi(z | x)$ rather than $p(z)$\n",
"\n",
"It is interesting to note that the first term, being a KL-divergence is always positive. As such the combination of the last two terms form a lower bound of $\\log p_\\theta(x)$ which *can* be computed and used as a training objective. This bound is called the *Evidence Lower-Bound (ELBO)*. Simply flipping its sign can make it into a loss that can be minimized by gradient descent:\n",
"\n",
"$$ \\mathcal{L}_{ELBO} = D_{KL}(q_\\phi(z | x) \\| p(z)) + \\mathbb{E}_{z \\sim q_\\phi} [ - \\log p_\\theta(x|z) ]$$\n",
"\n",
"From this formulation comes the parallel with auto-encoders that give the VAE its name: $q_\\phi(z | x)$ can be seen as a *probabilistic encoder* from the data $x$ to the latent space $z$, and $p_\\theta(x | z)$ can be seen as a *probabilistic decoder* from the latent space $z$ to the data $x$. In this case the second term of $\\mathcal{L}_{ELBO}$ is the loss measuring the reconstruction quality of the auto-encoder, and the first term can be seens as a regularization of the latent space."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q9: We can see that $p(z)$ is never sampled during the training process, how can that be a problem?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A typical choice to represent $q_\\phi(z | x)$ is to use a diagonal Gaussian distribution $\\mathcal{N}(\\mu_\\phi(x); Diag(\\sigma_\\phi(x)))$, which makes the KL-divergence term of $\\mathcal{L}_{ELBO}$ analytically computable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q10: Assuming $p(z)$ is a $\\mathcal{N}(0; Id)$ gaussian, what is the value of $D_{KL}(q_\\phi(z | x) \\| p(z))$?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also model $p_\\theta(x | z)$ as a diagonal Gaussian $\\mathcal{N}(\\mu_\\theta(z); Diag(\\sigma_\\theta(z)))$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q11: What is the expression of $-\\log p_\\theta(x | z)$ for given $x$ and $z$?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will build and train a VAE using the same dataset as previously, in order to compare its behavior to GANs. For numerical stability, we will interpret the output of the encoder and decoder networks as $(\\mu, \\log\\sigma^2)$, rather than $(\\mu, \\sigma)$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Choose a value for the latent dimension\n",
"LATENT_N = 10\n",
"\n",
"# Define the generator\n",
"class Encoder(nn.Module):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.fc1 = nn.Linear(2, 2)\n",
" self.fc_mu = nn.Linear(2, LATENT_N)\n",
" self.fc_logvar = nn.Linear(2, LATENT_N)\n",
" \n",
" # encode a datapoint. This should return a couple of tensors (mu, logvar) representing\n",
" # the parameters of the gaussian q_\\phi(z | x)\n",
" def __call__(self, x):\n",
" h = F.relu(self.fc1(x))\n",
" mu = self.fc_mu(h)\n",
" logvar = self.fc_logvar(h)\n",
" return (mu, logvar)\n",
" \n",
"\n",
"# Define the discriminator\n",
"class Decoder(nn.Module):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.fc1 = nn.Linear(LATENT_N, 2)\n",
" self.fc_mu = nn.Linear(2, 2)\n",
" self.fc_logvar = nn.Linear(2, 2)\n",
" \n",
" # decode a datapoint. This should return a couple of tensors (mu, logvar) representing\n",
" # the parameters of the gaussian p_\\theta(z | x)\n",
" def __call__(self, z):\n",
" h = F.elu(self.fc1(z))\n",
" mu = self.fc_mu(h)\n",
" logvar = self.fc_logvar(h)\n",
" return (mu, logvar)\n",
"\n",
" def generate(self, batchlen):\n",
" z = torch.normal(torch.zeros(batchlen, LATENT_N), 1.0)\n",
" (mu, logvar) = self.__call__(z)\n",
" return torch.normal(mu, torch.exp(0.5*logvar))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From this, the parameters of both networks are trained conjointly using the same loss $\\mathcal{L}_{ELBO}$. Pytorch allows us to sample the Gaussian distribution in a differentiable way using `torch.normal(mu, sigma)`, but it is not differentiable wrt to its inputs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q12: How can you sample a distribution $\\mathcal{N}(\\mu, \\sigma)$ is a way that is differentiable w.r.t. both $\\mu$ and $\\sigma$?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Total number of training iterations for the VAE\n",
"N_ITER = 40001\n",
"# Batch size to use\n",
"BATCHLEN = 128\n",
"\n",
"encoder = Encoder()\n",
"optim_enc = torch.optim.Adam(encoder.parameters(), lr=0.001, betas=(0.5,0.9))\n",
"decoder = Decoder()\n",
"optim_dec = torch.optim.Adam(decoder.parameters(), lr=0.001, betas=(0.5,0.9))\n",
"\n",
"for i in range(N_ITER):\n",
" encoder.zero_grad()\n",
" decoder.zero_grad()\n",
" \n",
" x = generate_batch(BATCHLEN)\n",
" \n",
" enc_mu, enc_logvar = encoder(x)\n",
" # Compute here the DKL part of the VAE loss\n",
" loss_kl = 0 # FILL HERE\n",
" # Compute here the sample z, using Q12\n",
" z = 0 # FILL HERE\n",
" \n",
" dec_mu, dec_logvar = decoder(z)\n",
" # Compute here the second part of the VAE loss\n",
" loss_rec = 0 # FILL HERE\n",
" \n",
" (loss_kl + loss_rec).backward()\n",
" optim_enc.step()\n",
" optim_dec.step()\n",
" if i%100 == 0:\n",
" print('step {}: KL: {:.3e}, rec: {:.3e}'.format(i, float(loss_kl), float(loss_rec)))\n",
" # plot the result\n",
" real_batch = generate_batch(1024)\n",
" rec_batch = torch.normal(dec_mu, torch.exp(0.5*dec_logvar)).detach()\n",
" fake_batch = decoder.generate(1024).detach()\n",
" plt.scatter(real_batch[:,0], real_batch[:,1], s=2.0, label='real data')\n",
" plt.scatter(rec_batch[:,0], rec_batch[:,1], s=2.0, label='rec data')\n",
" plt.scatter(fake_batch[:,0], fake_batch[:,1], s=2.0, label='fake data')\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q13: Try hardcoding $\\sigma_\\theta(z)$ to some small value (like 0.01) rather than allowing the decoder to learn it. What does it change?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q14: How do the power of encoder and decoder affect the overall training of the VAE?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q15: As a conclusion, how would you compare the advantages and shortcomings of GANs and VAEs?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> (write your answer here)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}