Can I start with perfect discriminator in GAN?

Question

In many implementations/tutorials of GANs that I've seen so far (e.g. this), the generator and discriminator start with no prior knowledge. They continuously improve their performance with training. This makes me wonder — is it possible to use a pre-trained discriminator? I have two motivations for doing so:

Eliminating the overhead of training the discriminator
Being able to use already existing cool models

Would the generator be able to learn just the same, or is it dependent on the fact that they start from scratch?

Some pretty interesting comments on the subject here: https://www.reddit.com/r/MachineLearning/comments/dwzb81/r_training_gans_with_a_pretrained_discriminator/ — tester, Dec 17 '20 at 21:05

Enes · Answer 1 · 2020-12-24T16:47:38.163

If you start with perpect discriminator, loss function will be saturated, and gradient of loss will be very small, so feedback for the generator also will be small, and learning will be slow down as a result. Actually, it is allways desired for discriminator and generator to learn balancedly. Additionally, it is claimed that Wasserstein Loss take care of this problem.

You can find more information in this article. (I strongly suggest to read)

Also from the paper " Towards principled methods for training generative adversarial networks":

In theory, one would expect therefore that we would first train the discriminator as close as we can to optimality (so the cost function on $\theta$ better approximates the JSD), and then do gradient steps on $\theta$, alternating these two things. However, this doesn’t work. In practice, as the discriminator gets better, the updates to the generator get consistently worse. The original GAN paper argued that this issue arose from saturation, and switched to another similar cost function that doesn’t have this problem. However, even with this new cost function, updates tend to get worse and optimization gets massively unstable.

Note 1: $\theta$ is parameters of generator

Note 2: JSD: Jensen–Shannon divergence. For optimal discrimantor, loss is equal to $L(D^{ \ast }, g_{\theta})= 2 JSD(P_{r}, P_{g}) - 2log(2)$

Also from the paper of Wasserstein Loss:

The fact that the EM distance is continuous and differentiable a.e. means that we can (and should) train the critic till optimality. The argument is simple, the more we train the critic, the more reliable gradient of the Wasserstein we get, which is actually useful by the fact that Wasserstein is differentiable almost everywhere. For the JS, as the discriminator gets better the gradients get more reliable but the true gradient is 0 since the JS is locally saturated and we get vanishing gradients ... In Figure 2 we show a proof of concept of this, where we train a GAN discriminator and a WGAN critic till optimality. The discriminator learns very quickly to distinguish between fake and real, and as expected provides no reliable gradient information. The critic, however, can’t saturate, and converges to a linear function that gives remarkably clean gradients everywhere.

Can I start with perfect discriminator in GAN?

1 Answers1