2

This is for Exercise 3.11 from Sutton and Barto's book.

If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?

The question already has an answer here. My approach is slightly different as I did not know the tower rule (which David's answer is based on) when starting this exercise. I would appreciate some inputs on whether my solution is valid.

According to the rules of conditional probability with multiple conditions (worked out in detail in this answer), we have

$$ P(A|B,\theta) = \frac{P(A,B | \theta)}{P(B|\theta)}$$

Applying this to the four-argument function, $p(s',r|s,a) = p(s',r|a,s)$, we have:

$$ p(s',r|s,a) = \frac{p(s',r,a|s)}{p(a|s)} $$

We recognize that the denominator in the RHS above is $\pi$. Substituting, we get: $$ p(s',r,a|s) = p(s',r|s,a) . \pi(a|s) $$

We get $p(r|s)$ by summing the LHS over $s'$ and $a$:

$$ p(r|s) = \sum_{s' \in S} \sum_{a \in A} p(s',a,r|s) =\sum_{s' \in S} \sum_{a \in A} p(s',r|s,a).\pi(a|s) $$

We compute the expectation as $$ \mathbf{E}[R_{t+1}|S_t =s] = \sum_{r \in R} r.P(r|S_t = s) $$

The subscripts in the notation are mainly due to convention.

Substituting above for $p(r|s)$ from the previous equation gives: $$ \mathbf{E}[R_{t+1} | s] = \sum_{r \in R} r.\sum_{s' \in S} \sum_{a \in A} p(s',r|s,a).\pi(a|s) $$

ahron
  • 265
  • 2
  • 7

1 Answers1

3

While the tower rule is a valid approach, your use of basic probability rules suffices here. You can rearrange your summations to match the standard result starting with the summation weighted by the policy as $$ \mathbb{E}[R_{t+1} | s] = \sum_{a \in A} \pi(a|s) \sum_{r \in R} \sum_{s' \in S} rp(s',r|s,a) $$

Therefore the application of tower rule is simply equivalent to your application of the basic probability chain rule and expectation.

cinch
  • 11,000
  • 3
  • 8
  • 17