When to use Value Iteration vs. Policy Iteration

Question

Both value iteration and policy iteration are General Policy Iteration (GPI) algorithms. However, they differ in the mechanics of their updates. Policy Iteration seeks to first find a completed value function for a policy, then derive the Q function from this and improve the policy greedily from this Q. Meanwhile, Value Iteration uses a truncated V function to then obtain Q updates, only returning the policy once V has converged.

What are the inherent advantages of using one over the other in a practical setting?

Archana David · Accepted Answer · 2021-12-26T18:41:35.750

Value Iteration Vs Policy Iteration

Below is the list of differences & similarities between value iteration and policy iteration

Differences

	Value Iteration	Policy Iteration
Architecture	VI Architecture reference	PI Architecture Reference
Execution starts with a	random value function	random policy
Algorithm	simpler	complex
Computation costs	more expensive	cheaper
Execution time	Slower	Faster
No of Iterations to converge	significantly more	takes fewer iteration to converge
Guaranteed to converge	Yes	Yes

Similarities

both are dynamic programming algorithms
both employ variations of Bellman updates
Both exploit one-step look-ahead
Both algorithms are guaranteed to converge to an optimal policy in the end

Policy iteration is reported to conclude faster than value iteration

USAGE PREFERENCE

As mentioned earlier in the difference, the main advantage for using Policy iteration over value iteration is its ability to conclude faster with fewer iterations thereby reducing its computation costs and execution time.

REFERENCES

Research papers
- Value iteration and policy iteration algorithms for Markov decision problem
- On the Complexity of Policy Iteration
Book
- Artificial Intelligence: A Modern Approach, by Peter Norvig and Stuart J. Russell Chapter 17 Making Complex decisions
Architecture
- VI Architecture reference
- PI Architecture Reference

score 0 · Answer 2 · 2021-10-23T16:12:48.807

Value iteration (VI) is a truncated version of Policy iteration (PI).

PI has two steps:

the first step is policy evaluation. That is to calculate the state values of a given policy. This step essentially solves the Bellman equation $$v_\pi=r_\pi+\gamma P_\pi v_\pi$$ which is the matrix vector form of the Bellman equation. I assume that the basics are already known.
The second is policy improvement. That is to select the action corresponding to the greatest action value at each state (i.e., greedy policy): $$\pi=\arg\max_\pi(r_\pi+\gamma P_\pi v_\pi)$$

The key point is: the policy iteration step requires an infinite number of iterations to solve the Bellman equation (i.e., get the exact state value). In particular, we use the following iterative algorithm so solve the Bellman equation: $$v_\pi^{(k+1)}=r_\pi+\gamma P_\pi v_\pi^{(k)}, \quad k=1,2,\dots$$ We can prove that $v_\pi^{(k)}\rightarrow v_\pi$ as $k\rightarrow\infty$. There are three cases to execute this iterative algorithm:

Case 1: run an infinite number of iterations so that $ v_\pi^{(\infty)}=v_\pi$. This is impossible in practice. Of course, in practice, we may run sufficiently many iterations until certain metrics (such as the difference between two consecutive values) are small enough).
Case 2: run just one single step so that $ v_\pi^{(2)}$ is used for policy improvement step.
Case 3: run a few times (e.g., N times) so that $ v_\pi^{(N+1)}$ is used for the policy improvement step.

Case 1 is the policy iteration algorithm; case 2 is the value iteration algorithm; case 3 is a more general truncated version. Such a truncated version does not require infinite numbers of iterations and can converge faster than case 2, it is often used in practice.

When to use Value Iteration vs. Policy Iteration

2 Answers2

Linked