5

To generate synthetic dataset using a trained VAE, there is confusion between two approaches:

  1. Use learned latent space: z = mu + (eps * log_var) to generate (theoretically, infinite amounts of) data. Here, we are learning mu and log_var vectors using the data, and, eps is sampled from multivariate, standard, Gaussian distribution.

  2. Use multivariate, standard, Gaussian distribution.

I am leaning more towards point 1 since we learn the mu and log_var vectors using our dataset. Whereas, point 2 uses the uninformative prior which contains no particular information about the dataset.

One of the reasons of VAE is to be able to learn this "unknown" latent space distribution by constraining it to approximate a multivariate, standard, Gaussian distribution, but at the same time, allow it sufficient flexibility to deviate from it too.

What are your thoughts? I have implemented some VAE, Conditional VAE codes both in TensorFlow 2 and PyTorch which you can refer to here.

nbro
  • 42,615
  • 12
  • 119
  • 217
Arun
  • 255
  • 2
  • 8

2 Answers2

3

Few more clarifications. While the correct thing to do is draw from the prior, we have no guarantees that the aggregated posterior will cover the prior. Think of the aggregated posterior as the distribution of the latent variables for your dataset (see here for a nice explanation and visualization). Our hope is that this will be like the prior but often in practice we get a mismatch between the prior and the aggregate posterior. In this case sampling from the prior might fail because part of it is not covered by the aggregate posterior. This can be solved in various ways, like learning the prior or computing the aggregated posterior after training.


Maybe there's a misconception, we are not learning a mu and log_var but a mapping (encoder) from an image to mu and log_var. This is quite different because the mu and log_var are not two fixed vectors for the dataset but are computed separately for each image.

In similar fashion, the decoder is a learned mapping from the prior distribution $N(0,I)$ back to the image space.

Essentially the encoder takes the image as input and spits out the parameters of another gaussian (the posterior). This means that during training the input of the decoder is conditioned upon the image. Let's take MNIST for example. We hope that after the training the encoder has learned to spit out similar mu and log_var for similar digits and that the decoder has learned to decode noise from a posterior to a specific digit.

For example with a 1-dimensional latent what we hope for is something like this:

Input digit 0 --> Encoder gives mu 0.1 log_var 0.3
Input digit 0 --> Encoder gives mu 0.2 log_var 0.2
Input digit 1 --> Encoder gives mu 1.4 log_var 0.2
Input digit 1 --> Encoder gives mu 1.5 log_var 0.1
...
Input digit 9 --> Encoder gives mu -4.5 log_var 0.3

This blogpost has a nice visualization with 2d latents.

If we didn't have the encoder, we would always draw noise from the same N(0,I) gaussian. This could also work but then we'd need a different training technique like in GANs.

During test time we many times want to draw a sample from the whole data distribution and for that reason we should use the prior $N(0,I)$. If you for some reason want to condition the output to look like a specific sample then you can use the posterior. For example if you only want digits of 1 then you can pass an image of 1 through the encoder and then use the mu, log_var to draw samples.

So the questions is, do you want a sample from the whole distribution? Then use the prior.

sfotiadis
  • 316
  • 1
  • 5
0

I think method 1 will provide the best output.

Approximating the empirical distribution of $z$ should provide decoder inputs in the subset of latent space that the decoder was trained on.

Sampling from $N(0,I)$ could undersample or omit some regions of the true distribution, oversample others, and even provide inputs to the decoder that it isn't trained for (and neural networks aren't usually good at extrapolation).

Lee Reeves
  • 511
  • 2
  • 5