Representing scalars as vectors for the network output

Question

In the MuZero Paper in the Appendix F they explain that they represent values and rewards as vectors.

This means that the neural networks don't output the scalars directly, instead, they output a probability distribution that later gets converted back to a scalar.

I wonder why it's done this way. Let's say they want to support a reward/value range of [-60000, 60000]. They could have the network output a scalar y and then do tanh(y)*60000 or even output the actual reward or value directly.

What's the advantage of representing Scalars as Vectors?

score 2 · Accepted Answer · answered Aug 30 '24 at 10:23

By representing the Rewards and Values as vectors, the network is able to model uncertainty. Instead of choosing one specific reward it can give multiple possible rewards, a non-zero probability.

This stabilizes the training process, since parameter updates are based on more information.

Representing scalars as vectors for the network output

1 Answers1