How do I update Q-values in Q-learning when rewards may only be received after many actions?

Question

I am working on a Q-learning system where the agent may well (and almost always) have to take many actions before a reward can be given to the agent (or more so, the notion of a reward in my context only becomes defined after many actions). How can I update Q-values when I only know the reward of a series of actions and not specific state-action pairs?

Doing my own research, I have found one solution to be the following: say a reward is given after state-action pairs (s1, a1), (s2, a2), and (s3, a3). Update the Q-values of (s3, a3) with the given reward, then update (s2, a2) with the new Q-value of (s3, a3), then update (s1, a1) with the new Q-value of (s2, a2). To me, this seems like a rather arbitrary approach. If this is the correct approach, could someone please explain why this is the case, and if not, what is the correct approach?

you can use n-step methods- almost all reinforcement learning methods has their corresponding n-step versions, like n-step Q-learning, n-step SARSA. Also for episodic tasks you can use monte carlo methods. Read chapter 7 of Sutton book (you can download its pdf) — Reza_va, Dec 19 '23 at 04:52
Ok thanks. Also, for anyone looking this book, I found it here: https://github.com/MrinmoiHossain/Reinforcement-Learning-Specialization-Coursera/blob/master/Book/Reinforcement%20Learning%20An%20introduction%20(Second%20Edition)%20by%20Richard%20S.%20Sutton%20and%20Andrew%20G.%20Barto.pdf — Hera Sutton, Dec 19 '23 at 05:06

foreverska · Accepted Answer · 2023-12-19T15:32:38.347

In the Tableau form of Q-learning the textbook way to attribute reward to a whole chain of state/action pairs leading to a reward is an eligibility trace.

Instead of $Q(S_3,A_3) = Q(S_2,A_2) = Q(S_1,A_1)$, it's more $Q(S_x,A_x) = a^x\cdot(S_1,A_1)$ where $a < 1.0$. This trails the reward back along the route the agent took. The reward is decreasing for each step back in history because the agent should not expect to receive full reward for traversing that state. But it does give the agent something to follow.

Eligibility trace tends to speed up convergence. If one thinks of a grid world without an eligibility trace, the state/action preceding a reward will receive the value of that reward directly. The next episode the states leading to that state will approach the q value of $r + \gamma \cdot max_aQ(S_{t+_1},A)$ as they are evaluated. The next episode the states leading to those states will take on the next value in the chain. So on and so forth until the Q values reach the starting state(s). What an eligibility trace can do is wind a rough estimate of this action back far enough to give epsilon greedy (or whatever strategy) a hint of where to go without having to work the q values all the way back over many episodes.

How do I update Q-values in Q-learning when rewards may only be received after many actions?

1 Answers1