6

Both value iteration and policy iteration are General Policy Iteration (GPI) algorithms. However, they differ in the mechanics of their updates. Policy Iteration seeks to first find a completed value function for a policy, then derive the Q function from this and improve the policy greedily from this Q. Meanwhile, Value Iteration uses a truncated V function to then obtain Q updates, only returning the policy once V has converged.

What are the inherent advantages of using one over the other in a practical setting?

nbro
  • 42,615
  • 12
  • 119
  • 217
SeeDerekEngineer
  • 541
  • 4
  • 11

2 Answers2

1

Value Iteration Vs Policy Iteration

Below is the list of differences & similarities between value iteration and policy iteration

Differences

Value Iteration Policy Iteration
Architecture Value iteration networkVI Architecture reference Policy iteration architecture PI Architecture Reference
Execution starts with a random value function random policy
Algorithm simpler complex
Computation costs more expensive cheaper
Execution time Slower Faster
No of Iterations to converge significantly more takes fewer iteration to converge
Guaranteed to converge Yes Yes

Similarities

  • both are dynamic programming algorithms
  • both employ variations of Bellman updates
  • Both exploit one-step look-ahead
  • Both algorithms are guaranteed to converge to an optimal policy in the end

Policy iteration is reported to conclude faster than value iteration

USAGE PREFERENCE

As mentioned earlier in the difference, the main advantage for using Policy iteration over value iteration is its ability to conclude faster with fewer iterations thereby reducing its computation costs and execution time.

REFERENCES

  1. Research papers
  2. Book
    • Artificial Intelligence: A Modern Approach, by Peter Norvig and Stuart J. Russell Chapter 17 Making Complex decisions
  3. Architecture
Archana David
  • 287
  • 2
  • 9
0

Value iteration (VI) is a truncated version of Policy iteration (PI).

PI has two steps:

  • the first step is policy evaluation. That is to calculate the state values of a given policy. This step essentially solves the Bellman equation $$v_\pi=r_\pi+\gamma P_\pi v_\pi$$ which is the matrix vector form of the Bellman equation. I assume that the basics are already known.
  • The second is policy improvement. That is to select the action corresponding to the greatest action value at each state (i.e., greedy policy): $$\pi=\arg\max_\pi(r_\pi+\gamma P_\pi v_\pi)$$

The key point is: the policy iteration step requires an infinite number of iterations to solve the Bellman equation (i.e., get the exact state value). In particular, we use the following iterative algorithm so solve the Bellman equation: $$v_\pi^{(k+1)}=r_\pi+\gamma P_\pi v_\pi^{(k)}, \quad k=1,2,\dots$$ We can prove that $v_\pi^{(k)}\rightarrow v_\pi$ as $k\rightarrow\infty$. There are three cases to execute this iterative algorithm:

  • Case 1: run an infinite number of iterations so that $ v_\pi^{(\infty)}=v_\pi$. This is impossible in practice. Of course, in practice, we may run sufficiently many iterations until certain metrics (such as the difference between two consecutive values) are small enough).
  • Case 2: run just one single step so that $ v_\pi^{(2)}$ is used for policy improvement step.
  • Case 3: run a few times (e.g., N times) so that $ v_\pi^{(N+1)}$ is used for the policy improvement step.

Case 1 is the policy iteration algorithm; case 2 is the value iteration algorithm; case 3 is a more general truncated version. Such a truncated version does not require infinite numbers of iterations and can converge faster than case 2, it is often used in practice.