3

I was pondering on the loss function of GAN, and the following thing turned out

\begin{aligned} L(D, G) & = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)] \\ & = \int_x \bigg( p_{r}(x) \log(D(x)) + p_g (x) \log(1 - D(x)) \bigg) dx \\ & =-\left[CE(p_r(x), D(x))+CE(p_g(x), 1-D(x)) \right] \\ \end{aligned} Where CE stands for cross-entropy. Then, by using law of large numbers: \begin{aligned} L(D, G) & = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)] \\ & =\lim_{m\to \infty}\frac{1}{m}\sum_{i=1}^{m}\left[1\cdot \log(D(x^{(i)}))+1\cdot \log(1-D(x^{(i)}))\right]\\ & =- \lim_{m \to \infty} \frac{1}{m}\sum_{i=1}^{m} \left[CE(1, D(x))+CE(0, D(x))\right] \end{aligned}

As you can see, I got a very strange result. This should be wrong intuitively because in the last equation first part is for real samples, and the second is for generated samples. However, I am curious about where are the mistakes?

(Please explain with math).

nbro
  • 42,615
  • 12
  • 119
  • 217
Enes
  • 324
  • 3
  • 11

2 Answers2

4

I guess the issue is you lost track of where the samples came from and since you requested a math explanation I'll try to go step by step using my notation and without checking other material to avoid being biased by how other authors present it

So we start from

$$ L(D,G) = E_{x \sim p_{r}(x)} \log(D(x)) + E_{x \sim p_{g}(x)}\log(1 - D(x)) $$

then you apply the definition of $E_{\cdot}(\cdot)$ operator in the continuous case

$$ L(D,G) = \int_{x} \log(D(x)) p_{r}(x)dx + \int_{x}\log(1 - D(x))p_{g}(x)dx $$

then you Monte Carlo sample it to approximate it

$$ L(D,G) = \frac{1}{n} \sum_{i=1}^{n} \log(D(x_{i}^{(r)})) + \frac{1}{m} \sum_{j=1}^{m}\log(1 - D(x_{j}^{(g)})) $$

As you can see here I have kept the samples from the 2 distributions separated and used a notation that allows to track their origin so now you can use the right label in the Cross Entropy

$$ L(D,G) = \frac{1}{n} \sum_{i=1}^{n} L_{ce}(1, D(x_{i}^{(r)})) + \frac{1}{m} \sum_{j=1}^{m} L_{ce}(0, D(x_{j}^{(g)})) $$

But you could also have decided to merge the 2 integrals before to have

$$ L(D,G) = \int_{x} \left( \log(D(x)) p_{r}(x) + \log(1 - D(x))p_{g}(x) \right) dx $$

which is mathematically legit operation, however the issue is when you try to discretize this with Monte Carlo sampling.

You can't just replace the integral with one sum since you are Monte Carlo sampling and here, contrary to what we have done above, you do not have 1 distribution per integral to sample but in the same integral you have 2 distributions and for each sample you have to say what distribution it comes from which is where the issue is in your notation since you lost track of this information and it seems all the samples come from one distribution

Enes
  • 324
  • 3
  • 11
Nicola Bernini
  • 809
  • 6
  • 12
1

$\textbf{Remark.}$ I'd leave this as a comment if I could.

Regarding notation (which I believe may be the cause of your issue here), the loss function is better written as \begin{align*} \operatorname{Loss} &= \frac{1}{m}\sum_{i=1}^m \left(\log D\big(x^{(i)}\big) + \log\Big(1-D\big(G\big(z^{(i)}\big)\big)\right)\\ &\approx \mathbb{E}_x[\log D(x)] + \mathbb{E}_z[\log(1-D(G(z)))], \end{align*} where the noise vectors, $z$, come from a suitable distribution, and $G(z)$ denotes the output of the generator; the $\approx$ symbol here implicitly assumes that the appropriate form of the Law of Large Numbers (LLN) applies.

Most importantly, the dependence on G is not trivial (for instance, what if $G$ never learns and always produces the same output?).

Also, the expectations should depend on their respective distributions, even when using LLN. For example, think of how you calculate the expectation of a discrete random variable.