Most RL books (Sutton & Barto, Bertsekas, etc.) talk about policy iteration for infinite-horizon MDPs. Does the policy iteration convergence hold for finite-horizon MDP? If yes, how can we derive the algorithm?
1 Answers
In the discussion about Neil Slater's answer (that he, sadly, deleted) it was pointed out that the policy $\pi$ should also depend on the horizon $h$. The decision of action $a$ can be influenced by how many steps are left. So, the "policy" in that case is actually a collection of policies $\pi_h(a|s)$ indexed by $h$ - the distance to horizon.
Alternatively, one can look at it as if our state space $\mathcal{S}$ is now extended with that integer. So we can "lift" our original finite-horizon MDP $(\mathcal{S},\mathcal{A},P,R)$ into a new infinite-horizon MDP by substituting:
$$\mathcal{S} \to \left(\mathcal{S}\times\{0,1,\dots,h\}\right) \cup \{\epsilon\}\\ \mathcal{A} \to \mathcal{A},\; R \to \tilde R,\; P \to \tilde P $$ The new reward and transition functions make sure that the horizon is decreasing, and that we are ending up in a capture state $\epsilon$ that has no future effects: $$ \tilde P(s'_{n-1}|s_n,a) = P(s'|s,a)\quad\quad \tilde P(\epsilon|s_0,a) =\tilde P(\epsilon|\epsilon,a) = 1\\ \tilde R(s_n,a,s'_{n-1}) = R(s,a,s')\quad\quad \tilde R(s_0,a,\epsilon) =\tilde R(\epsilon,a,\epsilon) = 0 $$ This way I've reduced the finite-horizon MDP to an infinite-horizon one. Thus I can reuse the result for the policy iteration convergence of infinite MDPs.
Couple of notes:
- At first this feels like a huge increase in the state space making the whole problem unnecessary complex. But this complexity is intrinsic to the problem: both the policy and the value function depend on the distance to horizon. So it is necessary to consider the extended number of unknowns in a single self-consistent manner.
- The infinite-horizon policy iteration convergence relies on a discounting factor $\gamma < 1$. The finite-horizon doesn't need $\gamma$ for convergence. That's where I feel like I've cheated a bit.
- I came up with this approach myself. It feels quite obvious though. I'd expect this approach to either be mistaken or be already mentioned somewhere in the literature - comments pointing out one or the other are welcome.
- 2,667
- 12
- 24