How do we get from conditional expectation on both state and action to only state in the proof of the Policy Improvement Theorem?

Question

I'm going through Sutton and Barto's book Reinforcement Learning: An Introduction and I'm trying to understand the proof of the Policy Improvement Theorem, presented at page 78 of the physical book.

The theorem goes as follows:

Let $\pi$ and $\pi'$ be any pair of deterministic policies such that, for all $s\in S$,

$q_{\pi}(s,\pi'(s))\geq v_{\pi}(s)$.

Then the policiy $\pi'$ must be as good as, or better than, $\pi$. That is, it must obtain greater or equal expected return from all states $s\in S$:

$v_{\pi'}(s)\geq v_{\pi}(s)$.

I take it that for the proof, the policy $\pi'$ is identical to $\pi$ except for one particular state $s$ (at each time step) for which we have $\pi'(s)=a\neq \pi(s)$, as suggested by @PraveenPalanisamy in his answer here.

The proof start from the statement of the theorem: $v_{\pi}(s)\leq q_{\pi}(s,\pi'(s))$

And then $q_{\pi}(s,\pi'(s))$ is developed as $\mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_{t}=s,A_{t}=\pi'(s)]=\mathbb{E}_{\pi'}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_{t}=s]$

I don't understand how did we get rid of the condition $A_{t}=\pi'(s)$. I don't think it's related to adding the subscript $\pi'$ to the expectation because it's something that should be done by definition since for the following time steps we choose policy $\pi$ which is exactly $\pi'$.

score 2 · Accepted Answer · answered Jun 22 '21 at 18:36

I don't understand how did we get rid of the condition $A_{t}=\pi'(s)$.

We don't really, it is just moved into the subscript $\pi'$ in $\mathbb{E}_{\pi'}[]$ - it means the same thing here, that the next action is chosen according to the modified policy $\pi'$. Moving the condition around is part of the proof's strategy, which eventually expresses the expectation in a familiar way so that we end up with a something that matches the definition of $v_{\pi'}(s)$.

How do we get from conditional expectation on both state and action to only state in the proof of the Policy Improvement Theorem?

1 Answers1