1

I'm going through Sutton and Barto's book Reinforcement Learning: An Introduction and I'm trying to understand the proof of the Policy Improvement Theorem, presented at page 78 of the physical book.

The theorem goes as follows:

Let $\pi$ and $\pi'$ be any pair of deterministic policies such that, for all $s\in S$,

$q_{\pi}(s,\pi'(s))\geq v_{\pi}(s)$.

Then the policiy $\pi'$ must be as good as, or better than, $\pi$. That is, it must obtain greater or equal expected return from all states $s\in S$:

$v_{\pi'}(s)\geq v_{\pi}(s)$.

I take it that for the proof, the policy $\pi'$ is identical to $\pi$ except for one particular state $s$ (at each time step) for which we have $\pi'(s)=a\neq \pi(s)$, as suggested by @PraveenPalanisamy in his answer here.

The proof start from the statement of the theorem: $v_{\pi}(s)\leq q_{\pi}(s,\pi'(s))$

And then $q_{\pi}(s,\pi'(s))$ is developed as $\mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_{t}=s,A_{t}=\pi'(s)]=\mathbb{E}_{\pi'}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_{t}=s]$

I don't understand how did we get rid of the condition $A_{t}=\pi'(s)$. I don't think it's related to adding the subscript $\pi'$ to the expectation because it's something that should be done by definition since for the following time steps we choose policy $\pi$ which is exactly $\pi'$.

Daviiid
  • 585
  • 5
  • 17

1 Answers1

2

I don't understand how did we get rid of the condition $A_{t}=\pi'(s)$.

We don't really, it is just moved into the subscript $\pi'$ in $\mathbb{E}_{\pi'}[]$ - it means the same thing here, that the next action is chosen according to the modified policy $\pi'$. Moving the condition around is part of the proof's strategy, which eventually expresses the expectation in a familiar way so that we end up with a something that matches the definition of $v_{\pi'}(s)$.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66