5

Most RL books (Sutton & Barto, Bertsekas, etc.) talk about policy iteration for infinite-horizon MDPs. Does the policy iteration convergence hold for finite-horizon MDP? If yes, how can we derive the algorithm?

nbro
  • 42,615
  • 12
  • 119
  • 217
user529295
  • 379
  • 2
  • 12

1 Answers1

3

In the discussion about Neil Slater's answer (that he, sadly, deleted) it was pointed out that the policy $\pi$ should also depend on the horizon $h$. The decision of action $a$ can be influenced by how many steps are left. So, the "policy" in that case is actually a collection of policies $\pi_h(a|s)$ indexed by $h$ - the distance to horizon.

Alternatively, one can look at it as if our state space $\mathcal{S}$ is now extended with that integer. So we can "lift" our original finite-horizon MDP $(\mathcal{S},\mathcal{A},P,R)$ into a new infinite-horizon MDP by substituting:

$$\mathcal{S} \to \left(\mathcal{S}\times\{0,1,\dots,h\}\right) \cup \{\epsilon\}\\ \mathcal{A} \to \mathcal{A},\; R \to \tilde R,\; P \to \tilde P $$ The new reward and transition functions make sure that the horizon is decreasing, and that we are ending up in a capture state $\epsilon$ that has no future effects: $$ \tilde P(s'_{n-1}|s_n,a) = P(s'|s,a)\quad\quad \tilde P(\epsilon|s_0,a) =\tilde P(\epsilon|\epsilon,a) = 1\\ \tilde R(s_n,a,s'_{n-1}) = R(s,a,s')\quad\quad \tilde R(s_0,a,\epsilon) =\tilde R(\epsilon,a,\epsilon) = 0 $$ This way I've reduced the finite-horizon MDP to an infinite-horizon one. Thus I can reuse the result for the policy iteration convergence of infinite MDPs.

Couple of notes:

  • At first this feels like a huge increase in the state space making the whole problem unnecessary complex. But this complexity is intrinsic to the problem: both the policy and the value function depend on the distance to horizon. So it is necessary to consider the extended number of unknowns in a single self-consistent manner.
  • The infinite-horizon policy iteration convergence relies on a discounting factor $\gamma < 1$. The finite-horizon doesn't need $\gamma$ for convergence. That's where I feel like I've cheated a bit.
  • I came up with this approach myself. It feels quite obvious though. I'd expect this approach to either be mistaken or be already mentioned somewhere in the literature - comments pointing out one or the other are welcome.
Kostya
  • 2,667
  • 12
  • 24