How does high entropy targets relate to less variance of the gradient between training cases?

Question

I've been trying to understand the Distilling the Knowledge in a Neural Network paper by Hinton et al. But I cannot fully understand this:

When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases [...]

The information part is very clear, but how does high entropy correlate to less variance between training cases?

score 0 · Answer 1 · edited Apr 09 '21 at 03:23

0

Since it is a trained network already, when you run an example through it, the gradient will not have a very high variance.

The gradient varies a lot when you are training a network from the scratch but then it stops varying much since it understands the pattern.

edited Apr 09 '21 at 03:23

nbro

42,615
12
119
217

answered Apr 06 '21 at 02:18

Abhishek Verma

878
4
6

score 0 · Answer 2 · answered Apr 23 '24 at 17:15

Suppose you have two tokens that are equally likely. In your original training data, your next-token predictions are

$$[1, 0]\quad\text{and}\quad [0, 1]$$

an equal number of times. The entropy of each of these samples is zero. However, a fully trained network should converge to

$$[0.5, 0.5],$$

which has an entropy of one bit. A second network trained on these higher-entropy samples will see less variance in its KL-loss. Also, because the learned network is essentially collating a whole lot of training samples together, we get more information per sample.

How does high entropy targets relate to less variance of the gradient between training cases?

2 Answers2