0

Here are two relations for value functions:

Eq.1: $v_{\pi}(s)=E_{\pi}[q_{\pi}(S_t, A_t)|S_t=s]$

Eq.1: $q_{\pi}(s,a)=E[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s, A_t=a]$

Question: Why is there subscript $\pi$ in the first expectation equation, but not in the second one? What is the logic in including/not including subscript $\pi$ in "E"?

DSPinfinity
  • 1,223
  • 4
  • 10

2 Answers2

1

The subscript just indicates over which variable the expectation is taken. For consistency it can be included everywhere. Sometimes, when the variable over which the expectation is taken is obvious, it is not precised. For example $E[x]$. In that case it's clear, however you can imagine, that the variable x is dependent on some latent feature and the expectation is not over the variable x, but its latent dependence, e.g. $E_z[x(z)] = E_z[x]$.

Regarding your case, I would assume, that in first case its emphasised that the expectation is taken over all the actions following the policy $\pi$. In the second case, you can conclude from the context, what is the variable of interest is.

You can find very similar question here. Hope it helps!

vl_knd
  • 496
  • 4
  • 14
1

Your second equation is defined on page 78 and in the same page there's a step on the lower part of derivations contains the answer to your confusion.

$q_{\pi}(s, {\pi'}(s)) \\= E[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t=s,A_t={\pi'}(s)] \\= E_{\pi'}[R_{t+1}+ \gamma v_{\pi}(S_{t+1})|S_t=s]$

You see the key is when your expectation is conditioned on a specific action at time $t$, the only remaining random variable as the capitalized reward $R_{t+1}$ is solely determined by the stochastic environment not any agent's policy. If it's not conditioned on a specific action at time $t$, then according to the backup diagram in your another post the only remaining random variable as the capitalized reward $R_{t+1}$ is not solely determined by the stochastic environment but also determined by which specific action $a$ is taken at time $t$ probabilistically based on the distribution of the random action $A_t$ controlled by the policy.

cinch
  • 11,000
  • 3
  • 8
  • 17