9

I created an OpenAI Gym environment, and I would like to check the performance of the agent from OpenAI Baselines DQN approach on it. In my environment, the best possible outcome for the agent is 0 - the robot needs zero non-necessary resources to complete a task. The goal is to minimize the need for resources: for each needed resource, there is a penalty of -1. In many states, only certain actions make physical sense. How do I deal with this?

There was already a question about the handling of invalid moves on AI StackExchange, recommending to ignore the invalid moves. However, ignoring them would imply returning the same state and a 0 reward, the best possible outcome, which is clearly not the case. Setting drastic negative rewards also does not seem to work, since even promising handling paths are compromised by invalid actions and the corresponding drastic negative reward.

What are other ways of handling invalid actions in scenarios where all rewards are either 0 (best) or negative?

My ideas/ questions on this for the OpenAI Baselines DQN approach implementation

  1. Is there any way to set the initial Q-values for the actions? I could set -infinity for the invalid actions.

  2. Is there any way to limit the set of valid actions per state? When after the env.step(action) function the new state is returned, can I somehow define which actions are valid for it?

nbro
  • 42,615
  • 12
  • 119
  • 217
AlexGuevara
  • 273
  • 2
  • 8

1 Answers1

5

1) Is there any way to set the initial Q-values for the actions?

You can generally do this, but you cannot specify specific weights for specific actions in specific states. Not through the network weights directly, at least. That would defeat the purpose of using backpropagation to optimize the weights and find the optimal parameters and Q-values.

2) Is there any way to limit the set of valid actions per state?

Again, not directly through the network. That would imply different network architectures for different states, which would imply optimizing several different networks. Unless you used some convoluted approach like Net2Net to keep the networks synced.

However, you can bypass the network itself.

You can, however, on top of the network, filter its outputs. Let's say you have a set of states, and in state X, only 2 out of all 5 actions are valid. When getting the Q-values for all the actions in DQN, you can detect if you're in state X, and instead of choosing the greedy action out of all actions, choose the greedy action out of all valid actions. Better yet, have your environment just send a list of all valid actions for whatever state you're in. Makes this more general, as it removes your need for discrete states, having lists of valid actions for each state, etc.

You can also really just ignore invalid actions.

You claimed

However, ignoring them would imply returning the same state and a 0 reward, the best possible outcome, which is clearly not the case.

If your environment has no time limit, then the best action really is to stand still and avoid wasting resources. However, if there is a time-limit, eventually the agent should receive some sort of penalty for not completing the task. Otherwise, if there is no time limit, you can also give him an award for completing the task. Imagining you have 10 resources, instead of having your reward interval be from [-10; 0], shift it to [0, 10] by awarding 10 points at the end of the game. This gives your agent the motivation to complete the task instead of doing useless actions, and discards your need for action-filtering at all.

BlueMoon93
  • 906
  • 5
  • 16