Questions tagged [expectation]

For questions related to the mathematical concept of "expectation" or "expected value".

27 questions
8
votes
2 answers

Why does the state-action value function, defined as an expected value of the reward and state value function, not need to follow a policy?

I often see that the state-action value function is expressed as: $$q_{\pi}(s,a)=\color{red}{\mathbb{E}_{\pi}}[R_{t+1}+\gamma G_{t+1} | S_t=s, A_t = a] = \color{blue}{\mathbb{E}}[R_{t+1}+\gamma v_{\pi}(s') |S_t = s, A_t =a]$$ Why does expressing the…
6
votes
1 answer

If the current state is $S_t$ and the actions are chosen according to $\pi$, what is the expectation of $R_{t+1}$ in terms of $\pi$ and $p$?

I'm trying to solve exercise 3.11 from the book Sutton and Barto's book (2nd edition) Exercise 3.11 If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms…
5
votes
1 answer

What does the argmax of the expectation of the log likelihood mean?

What does the following equation mean? What does each part of the formula represent or mean? $$\theta^* = \underset {\theta}{\arg \max} \Bbb E_{x \sim p_{data}} \log {p_{model}(x|\theta) }$$
5
votes
1 answer

Why is the mean used to compute the expectation in the GAN loss?

From Goodfellow et al. (2014), we have the adversarial loss: $$ \min_G \, \max_D V (D, G) = \mathbb{E}_{x∼p_{data}(x)} \, [\log \, D(x)] + \, \mathbb{E}_{z∼p_z(z)} \, [\log \, (1 − D(G(z)))] \, \text{.} \quad$$ In practice, the expectation is…
5
votes
2 answers

What is the difference between return and expected return?

At a time step $t$, for a state $S_{t}$, the return is defined as the discounted cumulative reward from that time step $t$. If an agent is following a policy (which in itself is a probability distribution of choosing a next state $S_{t+1}$ from…
4
votes
2 answers

$E_{\pi}[R_{t+1}|S_t=s,A_t=a] = E[R_{t+1}|S_t=s,A_t=a]$?

I would like to solve the first question of Exercise 3.19 from Sutton and Barto: Exercise 3.19 The value of an action, $q_{\pi}(s, a)$, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in…
4
votes
1 answer

How is the state-value function expressed as a product of sums?

The state-value function for a given policy $\pi$ is given by $$\begin{align} V^{\pi}(s) &=E_{\pi}\left\{r_{t+1}+\gamma r_{t+2}+\gamma^{2} r_{t+3}+\cdots \mid s_{t}=s\right\} \\ &=E_{\pi}\left\{r_{t+1}+\gamma V^{\pi}\left(s_{t+1}\right) \mid…
hanugm
  • 4,102
  • 3
  • 29
  • 63
4
votes
2 answers

Why is $G_{t+1}$ is replaced with $v_*(S_{t+1})$ in the Bellman optimality equation?

In equation 3.17 of Sutton and Barto's book: $$q_*(s, a)=\mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a]$$ $G_{t+1}$ here have been replaced with $v_*(S_{t+1})$, but no reason has been provided for why this step has been taken. Can…
4
votes
1 answer

What does $x,y \sim \hat{p}_{data}$ mean in the Deep Learning book by Goodfellow

In chapter 5 of Deep Learning book of Ian Goodfellow, some notations in the loss function as below make me really confused. I tried to understand $x,y \sim p_{data}$ means a sample $(x, y)$ sampled from original dataset distribution (or $y$ is the…
3
votes
1 answer

What is wrong with equation 7.3 in Sutton & Barto's book?

Equation 7.3 of Sutton Barto book: $$\text{Equation: } max_s|\mathbb{E}_\pi[G_{t:t+n}|S_t = s] - v_\pi| \le \gamma^nmax_s|V_{t+n-1}(s) - v_\pi(s)| $$ $$\text{where }G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + .....+\gamma^{n-1} R_{t+n} +…
3
votes
1 answer

How does $\mathbb{E}$ suddenly change to $\mathbb{E}_{\pi'}$ in this equation?

In Sutton-Barto's book on page 63 (81 of the pdf): $$\mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s,A_t=\pi'(s)] = \mathbb{E}_{\pi'}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_{t} = s]$$ How does $\mathbb{E}$ suddenly change to…
3
votes
1 answer

What is meant by the expected BLEU cost when training with BLEU and SIMILE?

Recently I was reading a paper based on a new evaluation metric SIMILE. In a section, validation loss comparison had been made for SIMILE and BLEU. The plot showed the expected BLEU cost when training with BLEU and SIMILE. What I'm unable to…
3
votes
1 answer

Shouldn't expected return be calculated for some faraway time in the future $t+n$ instead of current time $t$?

I am learning RL for the first time. It may be naive, but it is a bit odd to grasp this idea that, if the goal of RL is to maximize the expected return, then shouldn't the expected return be calculated for some faraway time in the future ($t+n$)…
SJa
  • 393
  • 3
  • 17
2
votes
1 answer

Expected reward $R_{t+1}$ in state $S_t$ as a function of policy $\pi$ and $p(s',r|s,a)$

This is for Exercise 3.11 from Sutton and Barto's book. If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$…
ahron
  • 265
  • 2
  • 7
2
votes
2 answers

How is per-decision importance sampling derived in Sutton & Barto's book?

In per-decison importance sampling given in Sutton & Barto's book: Eq 5.12 $\rho_{t:T-1}R_{t+k} =…
1
2