-1

I'm reading about Invalid Action Masking in RL in order to use it in my PPO algorithm for a specific task. The problem is that I read such explanations: here, here and here the there the invalid action is choosen based on a state of the environment. My question is: is it possible to determine the invalidity of an action based on the reward?

I want to make it clear based on a simple example. Let's take the old good Cartpole environment from OpenAI:

enter image description here

And let increase the action space from two to three. So there basically three discrete actions: $a_{t} \in [0, 1, 2]$, where 0 means left, 1 means right, 2 means do nothing.

How do I determine, which action - based on the state - should be masked out?

#### UPDATE ####

Thanks to @Neil-Slater for answering the question. Reading his answer I understood, that my question was missleading and I didn't get to the point. The example I provided above is too simplified and didn't reflect exactly my question. So I try again this time using the picture I posted already in this question here few days ago. In the picture a crane moves from the start position to the waypoint $a$ as a first move. Now... the agent could theoretically go directly to waypoint $d$ ignoring all the waypoints in between. So I thought to force the agent to move vertically at first by invaliding the actions along the $x, y$ axes.

Now... reading the answer from @Neil-Slater I can try to answer my question myself:

  1. The movement from start to $a$ does not require any invalidation of actions, because moving along the other axis is still permitted but not wanted. So the actions should not be masked at all. Reward should be enough.
  2. The movement form start to $a$ needs to be forced by using some masking of some sort, otherwise the agent would not learn a suitable policy for the vertical movement.

I hope, I could now express myself better and clearer

Dave
  • 214
  • 1
  • 11

1 Answers1

1

Invalid actions are invalid because they cannot be taken at all by the agent. In board game scenarios that is usually because they are not allowed by the rules of the game, so a rules engine for that game can be used to filter actions down to the allowed set.

An action that can be taken by an agent, but that would result in an unwanted reward is not considered invalid, and would therefore not be filtered by any approach that is called "invalid action masking".

Generally you would not want to remove an action from consideration entirely because something predicted a negative reward. That is already the role of the agent, to learn an optimal policy, and filtering out actions based on incomplete information will reduce the agent's ability to explore the space and find an optimal path.

The example extension of CartPole does not define any invalid actions. With no other information to go on, all three actions are always valid. The agent can take them, and the environment will move into a new state. Contrast this to chess, where an action is to pick up an opponent's piece and move it as if it was some different piece. This is invalid because it breaks the rules of the game and completing such a move would stop the environment being a normal game of chess. Such moves are usually filtered - although another option is to allow them to be requested, give a negative reward leaving the game in the same state, and let the agent re-request its move (or, more drastically, have the agent immediately forfeit the game). Most game-playing agents filter and don't process invalid moves in any way, since learning the rules of the game is not the main objective.

If you know more than the initial agent, it is possible to apply filtering to action choices any way you want to. The effect will be that the agent never experiences taking those actions, and never learns about them. So the same filtering must be left in place when you finish the learning, and should probably be considered a non-learning part of the agent.

You can filter an action in this way using any logic you like. And you might decide to do so for instance, if outside of the reinforcement learning agent, you were able to predict something disastrous, such as a robot falling off a table causing damage.

You can also "truncate" learning by having an exploration approach that cuts off for the worst action value predictions. This is not a usual approach, and has the same caveat as any other arbitrary action filtering - it may reduce the effectiveness of the agent, because maybe a trajectory that seemed bad initially would turn out to have a better end result once different actions are learned for later states. If you stop an agent from exploring that, it may never learn otherwise (although some off-policy and background planning approaches might have already collected enough data to self-correct).

Truncating learning in that way would be unusual. But "soft" versions of that approach where the worst predictions get the least exploration, are more common. Thompson sampling for reinforcement learning is one such approach.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66