Questions tagged [sutton-barto]

For questions related to the book "Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton).

"Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton) is often considered or cited as the most comprehensive introductory manual to the field of RL, by two of the greatest contributors to the field.

Two editions have been published so far. The first edition was published in 1998 and the second in 2018. You can find some material related to this book (including some drafts) at the following URL: http://incompleteideas.net/book/.

122 questions
24
votes
2 answers

What is the difference between reinforcement learning and optimal control?

Coming from a process (optimal) control background, I have begun studying the field of deep reinforcement learning. Sutton & Barto (2015) state that particularly important (to the writing of the text) have been the contributions establishing and…
18
votes
4 answers

Why does the discount rate in the REINFORCE algorithm appear twice?

I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017). On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at…
12
votes
4 answers

Counterexamples to the reward hypothesis

On Sutton and Barto's RL book, the reward hypothesis is stated as that all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called…
11
votes
2 answers

How do we prove the n-step return error reduction property?

In section 7.1 (about the n-step bootstrapping) of the book Reinforcement Learning: An Introduction (2nd edition), by Andrew Barto and Richard S. Sutton, the authors write about what they call the "n-step return error reduction property": But they…
8
votes
1 answer

How is the policy gradient calculated in REINFORCE?

Reading Sutton and Barto, I see the following in describing policy gradients: How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the…
7
votes
2 answers

How can the importance sampling ratio be different than zero when the target policy is deterministic?

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows: $$\rho…
7
votes
2 answers

In Value Iteration, why can we initialize the value function arbitrarily?

I have not been able to find a good explanation of this, other than statements that the algorithm is guaranteed to converge with arbitrary choices for initial values in each state. Is this something to do with the Bellman optimality constraint…
7
votes
2 answers

In the n-step off-policy SARSA update, why do we multiply the entire update by $\rho$?

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…
6
votes
5 answers

How do compute the table for $p(s',r|s,a)$ (exercise 3.5 in Sutton & Barto's book)?

I am trying to study the book Reinforcement Learning: An Introduction (Sutton & Barto, 2018). In chapter 3.1 the authors state the following exercise Exercise 3.5 Give a table analogous to that in Example 3.3, but for $p(s',r|s,a)$. It should have…
6
votes
1 answer

If $\gamma \in (0,1)$, what is the on-policy state distribution for episodic tasks?

In Reinforcement Learning: An Introduction, section 9.2 (page 199), Sutton and Barto describe the on-policy distribution in episodic tasks, with $\gamma =1$, as being \begin{equation} \mu(s) = \frac{\eta(s)}{\sum_{k \in S}…
6
votes
1 answer

If the current state is $S_t$ and the actions are chosen according to $\pi$, what is the expectation of $R_{t+1}$ in terms of $\pi$ and $p$?

I'm trying to solve exercise 3.11 from the book Sutton and Barto's book (2nd edition) Exercise 3.11 If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms…
5
votes
2 answers

Why is $\sum_{s} \eta(s)$ a constant of proportionality in the proof of the policy gradient theorem?

In Sutton and Barto's book (http://incompleteideas.net/book/bookdraft2017nov5.pdf), a proof of the policy gradient theorem is provided on pg. 269 for an episodic case and a start state policy objective function (see picture below, last 3…
5
votes
1 answer

What is the difference between an on-policy distribution and state visitation frequency?

On-policy distribution is defined as follows in Sutton and Barto: On the other hand, state visitation frequency is defined as follows in Trust Region Policy Optimization: $$\rho_{\pi}(s) = \sum_{t=0}^{T} \gamma^t P(s_t=s|\pi)$$ Question: What is…
5
votes
2 answers

Why does the definition of the reward function $r(s, a, s')$ involve the term $p(s' \mid s, a)$?

Sutton and Barto define the state–action–next-state reward function, $r(s, a, s')$, as follows (equation 3.6, p. 49) $$ r(s, a, s^{\prime}) \doteq \mathbb{E}\left[R_{t} \mid S_{t-1}=s, A_{t-1}=a, S_{t}=s^{\prime}\right]=\sum_{r \in \mathcal{R}} r…
5
votes
1 answer

Should the policy parameters be updated at each time step or at the end of the episode in REINFORCE?

REINFORCE is a Monte Carlo policy gradient algorithm, which updates weights (parameters) of policy network by generating episodes. Here's a pseudo-code from Sutton's book (which is same as the equation in Silver's RL note): When I try to implement…
1
2 3
8 9