3

I would like to employ DQN to solve a constrained MDP problem. The problem has constraints on action space. At different time steps till the end, the available actions are different. It has different possibilities as below.

  • 0, 1, 2, 3, 4
  • 0, 2, 3, 4
  • 0, 3, 4
  • 0, 4

Does this mean I need to learn 4 different Q networks for these possibilities? Also, correct me if I am wrong, it looks like if I specify the action size is 3, then it automatically assumes the actions are 0, 1, 2, but, in my case, it should be 0, 3, 4. How shall I implement this?

nbro
  • 42,615
  • 12
  • 119
  • 217
ycenycute
  • 351
  • 1
  • 2
  • 7

1 Answers1

2

There are two relevant neural network designs for DQN:

  • Model q function directly $Q(s,a): \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, so neural network has concatenated input of state and action, and outputs a single real value. This is arguably the more natural fit to Q learning, but can be inefficient.

  • Model all q values for given state $Q(s,\cdot): \mathcal{S} \rightarrow \mathbb{R}^{|\mathcal{A}|}$, so neural network takes input of current state and outputs all action values related to that state as a vector.

For the first architecture, you can decide which actions to evaluate by how you construct the minibatch. You pre-filter to the allowed actions for each state.

For the second architecture, you must post-filter the action values to those allowed by the state.

There are other possibilities for constructing variable-length inputs and outputs to neural networks - e.g. using RNNs. However, these are normally not worth the extra effort. A pre- or post- filter on the actions for a NN that can process the whole action space (including impossible actions) is all you usually need. Don't worry that the neural network may calculate some non-needed or nonsense values.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66