Is vanishing gradient and exploding gradient coming from practical constraint rather than theoritical?

Question

As the gradient is flowing back propagated from the output layer (last layer) into the input layer (first layer) that passed through multiple hidden layers by the chain rule, it's intuitively enough that the gradient reference of the previous layer is dependent on the next layer gradient reference.

There are two possibilities for how the gradient will change: vanishing gradient, where the gradient gradually becomes smaller (which leads to a to a small weight update), and exploding gradient, where the gradient gradually becomes higher (which leads to a to a higher weight update).

Practically, a gradient (which is just a number) is represented in IEEE 754 floating point; the precision depends on the maximum bit they had, e.g., float64. If the gradient range theoritically has range -5 to 5 for an architecture, then there must be wasted bits used, e.g., there are another 2^15 possible value that the value never reached.

So, if the gradient becomes smaller and smaller, like 10^-99, then one-bit changes can make slightly different values of the gradient, which distracts the overall computation.

But if we practically have a hardware constraint that considering the minimum of bit 000...00 is considered as -5 from domain-specific knowledge and the maximum of bit 111...111 is considered as 5, then the vanishing gradient will never exist since all bits are used efficiently.

In general, if there are infinity bits that carry more information, then there must never be such a thing as a vanishing gradient.

I tried with numpy, and it shows that the $-ln(\epsilon) \neq \infty $, but limited to relatively small number.

print(-np.log(np.finfo(np.float16).eps))
print(-np.log(np.finfo(np.float32).eps))
print(-np.log(np.finfo(np.float64).eps))

Resulting:

6.93
15.942385
36.04365338911715

For $log_2$, we have pretty round number:

print(-np.log2(np.finfo(np.float16).eps))
print(-np.log2(np.finfo(np.float32).eps))
print(-np.log2(np.finfo(np.float64).eps))

10.0
23.0
52.0

That round number I guess from number of mantissa bit size of IEEE 754.

This is critical for loss function calculation that using logarithmic scale such as categorical crossentropy.

Is vanishing gradient and exploding gradient coming from practical constraint rather than theoritical?

0 Answers0