5

The WGAN paper concretely proposes Algorithm 1 (cf. page 8). Now, they also state what their loss for the critic and the generator is.

When implementing the critic loss (so lines 5 and 6 of Algorithm 1), they maximize the parameters $w$ (instead of minimizing them as one would normally do) by writing $w \leftarrow w + \alpha \cdot \text{RMSProp}\left(w, g_w \right)$. Their loss seems to be $$\frac{1}{m}\sum_{i = 1}^{m}f_{w}\left(x^{\left(i\right)} \right) - \frac{1}{m}\sum_{i = 1}^{m} f_{w}\left( g_{\theta}\left( z^{\left( i\right)}\right)\right).\quad \quad (1)$$

The function $f$ is the critic, i.e. a neural network, and the way this loss is implemented in PyTorch in this youtbe video (cf. minutes 11:00 to 12:26) is as follows:

critic_real = critic(real_images)

critic_fake = critic(generator(noise))

loss_critic = -(torch.mean(critic_real) - torch.mean(critic_fake))

My question is: In my own experiments with the CelebA dataset, I found that the critic loss is negative, and that the quality of the images is better if the negative critic loss is higher instead of lower, so $-0.75$ for the critic loss resulted in better generated iamges than a critic loss of $-1.26$ e.g.

Is there an error in the implementation in the youtube video of Eq. (1) and Algorithm 1 of the WGAN paper maybe? In my opinion, the implementation in the video is correct, but I am still confused then on why I get better images when the loss is higher ...

Cheers!

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

0

I know that this is an old question, but here are my two cents.

  1. Yes, it is possible that the Wasserstein Loss goes negative.
  2. Yes the implementation seems to be correct. However, it also depends on the implementation of the generator loss. You can implement it in an inverted manner, and it would also work. The key is to ensure that the generator and the discriminator compete against one another, not with one another.
  3. Interpretation of the Wasserstein loss is a complex undertaking, and you should rarely make conclusions just based on whether the loss is going up or down. If have seen WGANs converge with all kinds of negative and positive losses in the critic. There are some resources out there that give you some tips on how to interpret the loss. However, the convergence of the loss highly depends on a bunch of factors such as the dataset, the type and size of the architecture, the learning rate and many more. You should assume that what works in the video on one dataset does not translate to any meaningful insights for training on another dataset with a different architecture.
  4. If you have exactly the same implementation and dataset, you should see approximately the same behaviour. However, note that through updates on libraries/differences across libraries, those changes might be obscure. These can involve the initialization of weights or default parameters of often-used functionalities.
  5. You should abstain from trying to compare different runs based on the trajectory of the loss. Please use a separate metric for that.
Robin van Hoorn
  • 2,780
  • 2
  • 12
  • 33