How to deal with invalid output in a policy network?

Question

I am interested in creating a neural network-based engine for chess. It uses a $8 \times 8 \times 73$ output space for each possible move as proposed in the Alpha Zero paper: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

However, when running the network, the first selected move is invalid. How should we deal with this? Basically, I see two options.

Pick the next highest outputted move, until it is a valid move. In this case, the network might automatically over time not put illegal moves on top.
Process the game as a loss for the player who picked the illegal move. This might have the disadvantage that the network might be 'stuck' on only a few legal moves.

What is the preferred solution to this particular problem?

score 0 · Answer 1 · edited Jul 10 '19 at 20:46

0

You should have a method to generate a possible moves output based on the board state. Use this as a mask before normalization in the policy head.

edited Jul 10 '19 at 20:46

nbro

answered Jun 10 '19 at 12:27

mshlis

1 Answers1