Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).
Policy iteration is a model-based algorithm.
The complexity of the algorithm is
where
is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is
.
The algorithm converges to the global optimum.
State-action value Q
State-action value of a policy
, is calculated by taking the specified action
immediately, then following the policy
Here,
is the reward function in MDP and
is the transition model.
Algorithm
- Set

- Initialize
randomly for all states 
- While
or
(L1-norm, measures if the policy changed for any state):
- Compute state-action value of a policy
, for all
and all

- Compute new policy
, for all
by choosing the action that returns the maximum state-action value for each specific state
Explanation
In each iteration, by definition we have
Proof