0

I've been trying to understand the Distilling the Knowledge in a Neural Network paper by Hinton et al. But I cannot fully understand this:

When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases [...]

The information part is very clear, but how does high entropy correlate to less variance between training cases?

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

0

Since it is a trained network already, when you run an example through it, the gradient will not have a very high variance.

The gradient varies a lot when you are training a network from the scratch but then it stops varying much since it understands the pattern.

nbro
  • 42,615
  • 12
  • 119
  • 217
Abhishek Verma
  • 878
  • 4
  • 6
0

Suppose you have two tokens that are equally likely. In your original training data, your next-token predictions are

$$[1, 0]\quad\text{and}\quad [0, 1]$$

an equal number of times. The entropy of each of these samples is zero. However, a fully trained network should converge to

$$[0.5, 0.5],$$

which has an entropy of one bit. A second network trained on these higher-entropy samples will see less variance in its KL-loss. Also, because the learned network is essentially collating a whole lot of training samples together, we get more information per sample.

programjames
  • 166
  • 8