For questions related to TD($\lambda$) family of algorithms.
Questions tagged [td-lambda]
13 questions
11
votes
3 answers
What is the intuition behind TD($\lambda$)?
I'd like to better understand temporal-difference learning. In particular, I'm wondering if it is prudent to think about TD($\lambda$) as a type of "truncated" Monte Carlo learning?
Nick Kunz
- 165
- 1
- 7
8
votes
2 answers
Why are lambda returns so rarely used in policy gradients?
I've seen the Monte Carlo return $G_{t}$ being used in REINFORCE and the TD($0$) target $r_t + \gamma Q(s', a')$ in vanilla actor-critic. However, I've never seen someone use the lambda return $G^{\lambda}_{t}$ in these situations, nor in any other…
jhinGhin
- 83
- 3
6
votes
1 answer
Can TD($\lambda$) be used with deep reinforcement learning?
TD lambda is a way to interpolate between TD(0) - bootstrapping over a single step, and, TD(max), bootstrapping over the entire episode length, or, Monte Carlo.
Reading the link above, I see that an eligibility trace is kept for each state in order…
Gulzar
- 789
- 1
- 10
- 27
5
votes
1 answer
Why not more TD() in actor-critic algorithms?
Is there either an empirical or theoretical reason that actor-critic algorithms with eligibility traces have not been more fully explored? I was hoping to find a paper or implementation or both for continuous tasks (not episodic) in continuous…
Nick Kunz
- 165
- 1
- 7
5
votes
2 answers
Why am I getting the incorrect value of lambda?
I am trying to solve for $\lambda$ using temporal-difference learning. More specifically, I am trying to figure out what $\lambda$ I need, such that $\text{TD}(\lambda)=\text{TD}(1)$, after one iteration. But I get the incorrect value of…
Amanda
- 205
- 1
- 5
3
votes
1 answer
Derivation of Sutton & Barto TD(λ) Weight Update Equation with Eligibility Traces
I'm working through Sutton & Barto's Reinforcement Learning: An Introduction, 2nd edition, and trying to understand the derivation of Equation 12.7 for TD(λ) weight updates in Chapter 12, specifically when using eligibility traces. Here’s the update…
AJR
- 31
- 1
3
votes
0 answers
How does bootstrapping work with the offline $\lambda$-return algorithm?
In Barton and Sutton's book, Reinforcement Learning: An Introduction (2nd edition), an expression, on page 289 (equation 12.2), introduced the form of the $\lambda$-return defined as follows
$$G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}…
quest ions
- 394
- 1
- 8
2
votes
0 answers
Why is TD(0) not converging to the optimal policy?
I am trying to implement the basic RL algorithms to learn on this 10x10 GridWorld (from REINFORCEJS by Kaparthy).
Currently I am stuck at TD(0). No matter how many episodes I run, when I am updating the policy after all episodes are done according…
PeeteKeesel
- 121
- 3
2
votes
1 answer
How is $\Delta$ updated in true online TD($\lambda$)?
In the RL textbook by Sutton & Barto section 7.4, the author talked about the "True online TD($\lambda$)". The figure (7.10 in the book) below shows the algorithm.
At the end of each step, $V_{old} \leftarrow V(S')$ and also $S \leftarrow S'$. When…
roy
- 53
- 3
1
vote
0 answers
When do you back-propagate errors through a neural network when using TD($\lambda$)?
I have a neural network that I'm want to use to self-play Connect Four. The neural network receives the board state and is to provide an estimate of the state's value.
I would then, for each move, use the highest estimate, occasionally I will use…
NeomerArcana
- 220
- 4
- 13
1
vote
0 answers
What is 'eligibility' in intuitive terms in TD($\lambda$) learning?
I am watching the lecture from Brown University (in udemy) and I am in the portion of Temporal Difference Learning.
In the pseudocode/algorithm of TD(1) (seen in the screenshot below), we initialise the eligibility $e(s) =0$ for all states. Later…
cgo
- 185
- 6
1
vote
0 answers
How is the general return-based off-policy equation derived?
I'm wondering how is the general return-based off-policy equation in Safe and efficient off-policy reinforcement learning derived
$$\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t}…
fish_tree
- 247
- 2
- 6
0
votes
1 answer
When using TD(λ), how do you calculate the eligibility trace per input & weight of a neural network neuron?
I have a Neural Network, each Neuron is made up of inputs, weights, and output. I have potentially multiple hidden layers. The activation function executed against the output is not known by the Neuron.
I would like to use TD(λ) to back-propagate…
NeomerArcana
- 220
- 4
- 13