4

Consider the grid world problem in RL. Formally, policy in RL is defined as $\pi(a|s)$. If we are solving grid world by policy iteration then the following pseudocode is used:

enter image description here

My question is related to the policy improvement step. Specifically, I am trying to understand the following update rule.

$$\pi(s) \leftarrow arg max_a \sum_{s'}p(s'|s,a)[r(s,a,s') + \gamma v(s')] $$

I can have 2 interpretations of this update rule.

  1. In this step, we check which action (say, going right for a particular state) has the highest reward and assign going right a probability of 1 and rest actions a probability of 0. Thus, in PE step, we will always go right for all iterations for that state even if it might not be the most rewarding function after certain iterations.

  2. We keep the policy improvement step in mind, and, while doing the PE step, we update the $v(s)$, based on the action giving highest reward (say, for 1st iteration $k=0$, going right gives the highest reward, we update based on that, while, for $k=1$, we see going left gives the highest reward, and update our value based on that likewise. Thus action changes depending on maximum reward).

For me, the second interpretation is very similar to value iteration. So, which one is the correct interpretation of a policy iteration?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

2

A policy can be stochastic or deterministic. A deterministic policy is a function of the form $\pi_{\text{deterministic}}: S \rightarrow A$, that is, a function from the set of states to the set of actions. A stochastic policy is a map of the form $\pi_{\text{stochastic}} : S \rightarrow P(A)$, where $P(A)$ is a set of probability distributions ($P(A) = \{ p_{s_1}(A), p_{s_2}(A), \dots, p_{s_{|S|}}(A) \}$, where $p_{s_i}(A)$ is a probability distribution over the set of actions $A$ for the state $s_i$ and $|S|$ is the size of the set of states of the environment) over the set of actions $A$. A deterministic policy can be interpreted as a stochastic policy that gives the probability of $1$ to one of the available actions (and $0$ to the remaining actions), for each state.

In the case of value iteration (VI) and policy iteration (PI), the policy is deterministic, both in the policy evaluation (PE) or policy improvement (PI) steps.

In the PE step, for a specific state $s$ (inside the for loop), you use $\pi(s)$, that is, you assume that the action taken in the specific state $s$ is the greedy action, because of the $\text{argmax}$ in the PI, according to the current policy. However, in general, $\pi_k(s_i) \neq \pi_k(s_j)$, for $i \neq j$, where $k$ is the current iteration step (of PE and PI).

The update rule

$$\pi_{k+1}(s) \leftarrow arg max_a \sum_{s'}p(s' \mid s, a)[r(s, a, s') + \gamma v_k(s')]$$

computes the action that is expected to give the highest return, given the current and fixed value function $(v_k$). Have a look at the definition of expectation for discrete random variables: it is defined as a weighted sum (like in the update rule above, where the weights are $p(s' \mid s, a)$).

Value iteration is a shorter version of policy iteration. In VI, rather than performing a PI step for each state of the environment, when performing PE, the action that is expected to give the highest return is chosen. In VI, the policy is updated only once and at the end of policy evaluation. In PI, you alternate between PE and PI, and, at each PI, you update the policy.

nbro
  • 42,615
  • 12
  • 119
  • 217