Why and how can the policy and value iteration methods converge to the OPTIMAL point?

Question

I am reading Reinforcement Learning: An Introduction by Sutton & Barto. According to this textbook, as far as I understood, the authors claim that the policy and value iteration methods converge to an optimal stationary point. Actually, I now understand the procedure of these two iterative algorithms, but I can't accept why they converge to an optimal point.

In the textbook and many posts that I found by googling, many people say that "The value functions are monotonically increased as the iteration progresses. Thus, it will go to the optimal policy, as well as optimal value functions."

I strongly agree that "only if the algorithm's performance is monotonically improved and there exist an upper bound in terms of performance, the algorithm will converge to a stationary point." However, I cannot accept the word "Optimal." I think, to claim an algorithm converges to an optimal stationary point, we need to show not only its monotonic improving property but also "its locally non-stopping property." (Sorry, I made these words myself, but I believe you experts can understand what I mean.)

I believe that there must be some points that I was not able to understand. Can someone let me know why the policy and value iteration methods converge to an "OPTIMAL" solution?

ps. Only if the system can be represented as a Markovian decision process, are either the policy or the value iteration method optimal algorithm?

score 3 · Accepted Answer · answered Jan 21 '22 at 23:02

These two algorithms converge to the optimal value function because

they are instances of the generalization policy iteration, so they iteratively perform one policy evaluation (PE) step followed by a policy improvement (PI) step
the PE step is an iterative/numerical implementation of the Bellman expectation operator (BEO) (i.e. it's numerical algorithm equivalent to solving a system of equations); here you have an explanation of what the Bellman operator is
the BEO is a contraction (proof here), so the iterative application of the BEO makes the approximate value function closer to the optimal one, which is unique, i.e. PE convergences to the optimal value function of the current policy (proof here)
Policy improvement is guaranteed to generate a policy that is better than the one in the previous iteration, unless the policy in the previous iteration was already optimal (see the policy improvement theorem in section 4.2 of the RL bible)

One thing that may confuse you is that you don't exactly know or have in mind the definition of the value function. A value function $v_\pi(s)$ is defined as the expected return that you will get starting in state $s$, then following policy $\pi$. So, if you have some policy $\pi$, then you perform one PE step until convergence, then you know that that value function is the optimal value function for $\pi$. Now, if $\pi_{t+1}$ is guaranteed to be a strict improvement over $\pi_{t}$, then it basically means that you will get more rewards with $\pi_{t+1}$ (which is the goal).

If you read the linked proofs and chapter 4 of the bible, then you should understand why these algorithms converge.

To address your last point, yes, we assume that we have an MDP. That's an assumption that most famous DP and RL algorithms make.

yunfeng · Answer 2 · 2022-10-12T02:17:28.147

Base on the fantastic answer by @nbro, let me complement a little bit clue which may help understand Policy Iteration's optimality convergence.

So Danny you said:

However, I cannot accept the word "Optimal." I think, to claim an algorithm converges to an optimal stationary point, we need to show not only its monotonic improving property but also "its locally non-stopping property."

I think here you actually worry about it may miss some possible cases which contains the optimal one in the process of iteration. However it won't. Here is a excerpt on page 75 in Reinforcement Learing: An Introduction written by Richard S. Sutton:

All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.

So it's about all possible next states rather than on a sample next state, which won't miss some possible cases in the iteration.

Why and how can the policy and value iteration methods converge to the OPTIMAL point?

2 Answers2

Linked

Related