43

I want to create an AI which can play five-in-a-row/Gomoku. I want to use reinforcement learning for this.

I use the policy gradient method, namely REINFORCE, with baseline. For the value and policy function approximation, I use a neural network. It has convolutional and fully connected layers. All of the layers, except for the output, are shared. The policy's output layer has $8 \times 8=64$ (the size of the board) output unit and softmax on them. So it is stochastic.

But what if the network produces a very high probability for an invalid action? An invalid action is when the agent wants to check a square that has one X or O in it. I think it can be stuck in that game state.

How should I handle this situation?

My guess is to use the actor-critic method. For an invalid action, we should give a negative reward and pass the turn to the opponent.

hanugm
  • 4,102
  • 3
  • 29
  • 63
Molnár István
  • 712
  • 1
  • 7
  • 11

5 Answers5

22

Just ignore the invalid moves.

For exploration, it is likely that you won't just execute the move with the highest probability, but instead choose moves randomly based on the outputted probability. If you only punish illegal moves they will still retain some probability (however small) and therefore will be executed from time to time (however seldom). So you will always retain an agent which occasionally makes illegal moves.

To me, it makes more sense to just set the probabilities of all illegal moves to zero and renormalise the output vector before you choose your move.

hanugm
  • 4,102
  • 3
  • 29
  • 63
BlindKungFuMaster
  • 4,265
  • 13
  • 23
14

Usually softmax methods in policy gradient methods using linear function approximation use the following formula to calculate the probability of choosing action $a$. Here, weights are $\theta$, and the features $\phi$ is a function of the current state $s$ and an action from the set of actions $A$.

$$ \pi(\theta, a) = \frac{e^{\theta \phi(s, a)}}{\sum_{b \in A} e^{\theta \phi(s, b)}} $$

To eliminate illegal moves, one would limit the set of actions to only those that were legal, hence $Legal(A)$.

$$ \pi(\theta, a) = \frac{e^{\theta \phi(s, a)}}{\sum_{b \in Legal(A)} e^{\theta \phi(s, b)}}, \, a \in Legal(A) $$

In pseudocode the formula may look like this:

action_probs = Agent.getActionProbs(state)
legal_actions = filterLegalActions(state, action_probs)
best_legal_action = softmax(legal_actions)

Whether using linear or non-linear function approximation (your neural network), the idea is to only use the legal moves when computing your softmax. This method means that only valid moves will be given by the agent, which is good if you wanted to change your game later on, and that the difference in value between the limited choice in actions will be easier to discriminate by the agent. It will also be faster as the number of possible actions decreases.

nbro
  • 42,615
  • 12
  • 119
  • 217
Jaden Travnik
  • 3,867
  • 1
  • 18
  • 35
7

I faced a similar issue recently with Minesweeper.

The way I solved it was by ignoring the illegal/invalid moves entirely.

  1. Use the Q-network to predict the Q-values for all of your actions (valid and invalid)
  2. Pre-process the Q-values by setting all of the invalid moves to a Q-value of zero/negative number (depends on your scenario)
  3. Use a policy of your choice to select an action from the refined Q-values (i.e. greedy or Boltzmann)
  4. Execute the selected action and resume your DQN logic

Hope this helps.

Sanavesa
  • 163
  • 1
  • 6
6

IMHO the idea of invalid moves is itself invalid. Imagine placing an "X" at coordinates (9, 9). You could consider it to be an invalid move and give it a negative reward. Absurd? Sure!

But in fact your invalid moves are just a relic of the representation (which itself is straightforward and fine). The best treatment of them is to exclude them completely from any computation.

This gets more apparent in chess:

  • In a positional representation, you might consider the move a1-a8, which only belongs in the game if there's a Rook or a Queen at a1 (and some other conditions hold).

  • In a different representation, you might consider the move Qb2. Again, this may or may not belong to the game. When the current player has no Queen, then it surely does not.

As the invalid moves are related to the representation rather than to the game, they should not be considered at all.

maaartinus
  • 563
  • 3
  • 9
2

An experimental paper exist in arxiv about the effect of whether to mask or to give negative rewards to invalid actions. There are some references in this paper which also discuss the effects and the mechanism to handle invalid actions. However, those main references are still only pre-prints in the arxiv (not published and presumably not peer-reviewed yet).

On a way to handle that situation, other answers have given good practical methods to ignore the invalid actions. I just want to add one more trick to do that. You can pre-compute a binary vector as the mask for the actions, and add the log of the mask before the softmax operation. The log of 0 is -inf and exp(-inf) is 0 in Pytorch (I don't know if the same applies in Tensorflow).

$P(a_i| A'_t, X_t) = \text{softmax}(u_i+\log{m^t_i}), \forall i \in \{1,\dots,|A|\}$ and $m^t_i \in \{0,1\}$

with $P(a_i| A'_t, X_t)$ is the probability to take action $a_i$ given the action history $A'_t$ up to the $t$-th step and the current environment's features $X_t$, $u$ is the output of the last layer of the model, $A$ is the action space $m^t_i$ is the feasibility mask of action $a_i$ for the current step. Therefore, the probability of the invalid action is 0 after the softmax operation. That way, you can treat the mask as the part of the state as the input to your model. This is actually more handy for algorithm which employs experience memory because the mask then can be saved in the experience memory too.

Sanyou
  • 175
  • 2
  • 10