Why doesn't the inception score measure intra-class diversity

Question

It's mentioned here that there is no measure of intra-class diversity with the inception score:

If your generator generates only one image per classifier image class, repeating each image many times, it can score highly (i.e. there is no measure of intra-class diversity)

However, isn't it "easy" to look at the variance of the outputs of the classifier for a given class (e.g. if you only output 0.97 for all the images of your GAN class then there is no intra-class diversity but if you output 0.97, 0.95, 0.99, 0.92, there is diversity?). I'm struggling to understand why this is hard to do (but I might be missing something!).

score 4 · Accepted Answer · answered May 06 '22 at 14:30

For reference, a recap of Inception Score: The inception score is computed by comparing the categorical output distributions of an inception model, given examples from real vs synthetic images. If the synthetic images produce similar class distributions as the real images, the inception score is high, otherwise it is low.

However, isn't it "easy" to look at the variance of the outputs of the classifier for a given class

Say you want to generate multiple horses and the model learns to generate horses with different colors but always in the same pose - then your class probabilities will vary, but I wouldn't call this very diverse horse generation. This is how I would understand what is meant by your cited statement.

The output distributions from the inception model contain class information but very little information of specific image features. Thus, the inception score cannot be sensitive to intra-class variations of the generator.

score 3 · Answer 2 · answered May 06 '22 at 14:55

Adding on top of Chillston answer:

regarding the variance, it is unfortunately not so straight forward. The problem being that most deep learning models are not calibrated, hence small intra-class variation might lead to large probability variations for the winning class. Maybe one way to account for this issue would be to compute the mutual information between the generated predictions and a prior expected distribution, for example uniform probability distribution 1/n with n number of expected difference within class, like horse poses to use the same example as Chillstone, but I found no reference about similar attempts, plus coming up with a proper prior expected distribution doesn't sound trivial at all. I guess the reason is that the inception score was designed for generic GANs (i.e. GANs trained to generate generic classes from CIFAR and similar dataset) so without having in mind to measure variability within classes.

Why doesn't the inception score measure intra-class diversity

2 Answers2