0

In the MuZero Paper in the Appendix F they explain that they represent values and rewards as vectors. enter image description here

This means that the neural networks don't output the scalars directly, instead, they output a probability distribution that later gets converted back to a scalar.

I wonder why it's done this way. Let's say they want to support a reward/value range of [-60000, 60000]. They could have the network output a scalar y and then do tanh(y)*60000 or even output the actual reward or value directly.

What's the advantage of representing Scalars as Vectors?

Lynix
  • 33
  • 3

1 Answers1

2

By representing the Rewards and Values as vectors, the network is able to model uncertainty. Instead of choosing one specific reward it can give multiple possible rewards, a non-zero probability.

This stabilizes the training process, since parameter updates are based on more information.

Lynix
  • 33
  • 3