How to use DQN when the action space can be different at different time steps?

Question

I would like to employ DQN to solve a constrained MDP problem. The problem has constraints on action space. At different time steps till the end, the available actions are different. It has different possibilities as below.

0, 1, 2, 3, 4
0, 2, 3, 4
0, 3, 4
0, 4

Does this mean I need to learn 4 different Q networks for these possibilities? Also, correct me if I am wrong, it looks like if I specify the action size is 3, then it automatically assumes the actions are 0, 1, 2, but, in my case, it should be 0, 3, 4. How shall I implement this?

Neil Slater · Accepted Answer · 2021-01-09T20:33:40.707

There are two relevant neural network designs for DQN:

Model q function directly $Q(s,a): \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, so neural network has concatenated input of state and action, and outputs a single real value. This is arguably the more natural fit to Q learning, but can be inefficient.
Model all q values for given state $Q(s,\cdot): \mathcal{S} \rightarrow \mathbb{R}^{|\mathcal{A}|}$, so neural network takes input of current state and outputs all action values related to that state as a vector.

For the first architecture, you can decide which actions to evaluate by how you construct the minibatch. You pre-filter to the allowed actions for each state.

For the second architecture, you must post-filter the action values to those allowed by the state.

There are other possibilities for constructing variable-length inputs and outputs to neural networks - e.g. using RNNs. However, these are normally not worth the extra effort. A pre- or post- filter on the actions for a NN that can process the whole action space (including impossible actions) is all you usually need. Don't worry that the neural network may calculate some non-needed or nonsense values.

How to use DQN when the action space can be different at different time steps?

1 Answers1