Questions tagged [sutton-barto]

For questions related to the book "Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton).

"Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton) is often considered or cited as the most comprehensive introductory manual to the field of RL, by two of the greatest contributors to the field.

Two editions have been published so far. The first edition was published in 1998 and the second in 2018. You can find some material related to this book (including some drafts) at the following URL: http://incompleteideas.net/book/.

122 questions

votes

2 answers

What is the difference between reinforcement learning and optimal control?

Coming from a process (optimal) control background, I have begun studying the field of deep reinforcement learning. Sutton & Barto (2015) state that particularly important (to the writing of the text) have been the contributions establishing and…

asked Mar 22 '19 at 10:57

Bionic Buffulo

votes

4 answers

Why does the discount rate in the REINFORCE algorithm appear twice?

I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017). On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at…

reinforcement-learning algorithm sutton-barto reinforce

asked Aug 22 '18 at 18:06

Diego Orellana

votes

4 answers

Counterexamples to the reward hypothesis

On Sutton and Barto's RL book, the reward hypothesis is stated as that all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called…

reinforcement-learning sutton-barto reward-design reward-functions reward-hypothesis

asked Nov 19 '19 at 02:30

Bananin

votes

2 answers

How do we prove the n-step return error reduction property?

In section 7.1 (about the n-step bootstrapping) of the book Reinforcement Learning: An Introduction (2nd edition), by Andrew Barto and Richard S. Sutton, the authors write about what they call the "n-step return error reduction property": But they…

reinforcement-learning q-learning math proofs sutton-barto

asked Dec 08 '18 at 05:24

123learn

votes

1 answer

How is the policy gradient calculated in REINFORCE?

Reading Sutton and Barto, I see the following in describing policy gradients: How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the…

reinforcement-learning policy-gradients sutton-barto notation reinforce

asked Apr 21 '19 at 19:23

Hanzy

votes

2 answers

How can the importance sampling ratio be different than zero when the target policy is deterministic?

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows: $$\rho…

reinforcement-learning off-policy-methods sutton-barto importance-sampling

asked Jan 09 '19 at 17:28

F.M.F.

votes

2 answers

In Value Iteration, why can we initialize the value function arbitrarily?

I have not been able to find a good explanation of this, other than statements that the algorithm is guaranteed to converge with arbitrary choices for initial values in each state. Is this something to do with the Bellman optimality constraint…

reinforcement-learning machine-learning value-functions sutton-barto value-iteration

asked Jun 05 '22 at 16:05

Arham

votes

2 answers

In the n-step off-policy SARSA update, why do we multiply the entire update by $\rho$?

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…

reinforcement-learning sutton-barto off-policy-methods temporal-difference-methods sarsa

asked Apr 05 '19 at 14:23

Antoine Savine

votes

5 answers

How do compute the table for $p(s',r|s,a)$ (exercise 3.5 in Sutton & Barto's book)?

I am trying to study the book Reinforcement Learning: An Introduction (Sutton & Barto, 2018). In chapter 3.1 the authors state the following exercise Exercise 3.5 Give a table analogous to that in Example 3.3, but for $p(s',r|s,a)$. It should have…

reinforcement-learning sutton-barto probability-theory transition-model

asked Oct 21 '18 at 13:35

MrYouMath

votes

1 answer

If $\gamma \in (0,1)$, what is the on-policy state distribution for episodic tasks?

In Reinforcement Learning: An Introduction, section 9.2 (page 199), Sutton and Barto describe the on-policy distribution in episodic tasks, with $\gamma =1$, as being \begin{equation} \mu(s) = \frac{\eta(s)}{\sum_{k \in S}…

reinforcement-learning policy-gradients sutton-barto on-policy-methods discount-factor

asked May 13 '21 at 22:22

Felipe Costa

votes

1 answer

If the current state is $S_t$ and the actions are chosen according to $\pi$, what is the expectation of $R_{t+1}$ in terms of $\pi$ and $p$?

I'm trying to solve exercise 3.11 from the book Sutton and Barto's book (2nd edition) Exercise 3.11 If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms…

reinforcement-learning rewards sutton-barto expectation transition-model

asked Jun 05 '20 at 12:58

tmaric

votes

2 answers

Why is $\sum_{s} \eta(s)$ a constant of proportionality in the proof of the policy gradient theorem?

In Sutton and Barto's book (http://incompleteideas.net/book/bookdraft2017nov5.pdf), a proof of the policy gradient theorem is provided on pg. 269 for an episodic case and a start state policy objective function (see picture below, last 3…

reinforcement-learning policy-gradients proofs sutton-barto policy-gradient-theorem

asked Jan 11 '23 at 20:50

jwl17

votes

1 answer

What is the difference between an on-policy distribution and state visitation frequency?

On-policy distribution is defined as follows in Sutton and Barto: On the other hand, state visitation frequency is defined as follows in Trust Region Policy Optimization: $$\rho_{\pi}(s) = \sum_{t=0}^{T} \gamma^t P(s_t=s|\pi)$$ Question: What is…

reinforcement-learning comparison sutton-barto trust-region-policy-optimization on-policy-distribution

asked Dec 08 '21 at 10:36

user529295

votes

2 answers

Why does the definition of the reward function $r(s, a, s')$ involve the term $p(s' \mid s, a)$?

Sutton and Barto define the state–action–next-state reward function, $r(s, a, s')$, as follows (equation 3.6, p. 49) $$ r(s, a, s^{\prime}) \doteq \mathbb{E}\left[R_{t} \mid S_{t-1}=s, A_{t-1}=a, S_{t}=s^{\prime}\right]=\sum_{r \in \mathcal{R}} r…

reinforcement-learning definitions markov-decision-process reward-functions sutton-barto

asked Apr 14 '20 at 06:15

SAGALPREET SINGH

votes

1 answer

Should the policy parameters be updated at each time step or at the end of the episode in REINFORCE?

REINFORCE is a Monte Carlo policy gradient algorithm, which updates weights (parameters) of policy network by generating episodes. Here's a pseudo-code from Sutton's book (which is same as the equation in Silver's RL note): When I try to implement…

reinforcement-learning pytorch implementation sutton-barto reinforce

asked Oct 29 '19 at 07:12

Seewoo Lee

2 3

…

8 9 Next