I am working on a Q-learning system where the agent may well (and almost always) have to take many actions before a reward can be given to the agent (or more so, the notion of a reward in my context only becomes defined after many actions). How can I update Q-values when I only know the reward of a series of actions and not specific state-action pairs?
Doing my own research, I have found one solution to be the following: say a reward is given after state-action pairs (s1, a1), (s2, a2), and (s3, a3). Update the Q-values of (s3, a3) with the given reward, then update (s2, a2) with the new Q-value of (s3, a3), then update (s1, a1) with the new Q-value of (s2, a2). To me, this seems like a rather arbitrary approach. If this is the correct approach, could someone please explain why this is the case, and if not, what is the correct approach?