4

Note: I will use FV abbreviation for first-visit and EV for every-visit.

I am reading the famous Barto Sutton Reinforcement Learning book (second edition), and after reading the following exercise, I stumbled upon a thought I can't seem to find a satisfactory answer to:

Exercise 5.5 Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability p and transitions to the terminal state with probability 1 - p. Let the reward be +1 on all transitions, and let gamma = 1 (discount rate). Suppose you observe one episode that lasts 10 steps, with a return of 10. What are the first-visit and every-visit estimators of the value of the nonterminal state?

Now, I know the logical answer to this should be, for FV V(s) = 10 and for EV V(s) = 5.5

But my implementation finds V(s) = 1 for FV, because it considers a first-visit to the state to be the one visited in the backwards iteration implementation of the algorithm, not the actual order in which the state occurs in the given episode.

I assume the correct implementation should consider the natural order in which the state first occurs in the episode, even if we are iterating backwards for efficiency reasons (though I can't defend this assumption). I couldn't see this from the pseudocode, it is quite ambigous, since if we trace it (below, taken from the book), it seems like when interpreted literally, V(s) should = 1 in this example.

enter image description here

In short, my question is:

  1. When we are doing MC FV, by FV do we mean the actual order in which the state first occurs in the episode, regardless of implementational details (i.e. forward/backward iteration).

  2. If so, why prefer one to the other? What is the intuitive answer to this?

If I am not mistaken, I haven't seen this addressed in the book, nor anywhere else I could find, so I would be very happy if someone can give me insights into this.

Thank you.

nbro
  • 42,615
  • 12
  • 119
  • 217
Silidrone
  • 143
  • 5

2 Answers2

5

I assume the correct implementation should consider the natural order in which the state first occurs in the episode, even if we are iterating backwards for efficiency reasons (though I can't defend this assumption). I couldn't see this from the pseudocode, it is quite ambiguous

(It's been a long time since I studied these algorithms, but let's see if can help you.)

That pseudo-code tells you unless $S_t, A_t$ appear in the sequence from $0$ to $t-1$. Still, we iterate from $T$ to $0$, but does it really matter? We need to follow the symbols and their meanings.

Now, the problem here is interpreting that unless (horrible word because it's sometimes used to mean the opposites).

Does it mean that the code is executed

  1. if $S_t, A_t$ is in the sequence from $0$ to $t-1$, or
  2. if $S_t, A_t$ is not in that sequence

If we execute that code block if it's not in that sequence (option 2), that means that $S_t, A_t$ is the first time we encounter this pair using the natural order, i.e. the order we visited the states during the episode.

If we execute the code if it is in that sequence (first option), then we might execute that code more than once (and exactly the number of times we encounter that pair minus 1), so that doesn't look like first-visit.

Maybe my answer here can also be useful, although the pseudocode you're showing is for control.

nbro
  • 42,615
  • 12
  • 119
  • 217
5

Backward iteration algo above only simplifies the computation of cumulative return $G$ from any step $t$ since you only need to add the experienced reward of the current step to the previously computed $G$ as value estimates for every state in a backward fashion, eliminating the need to recompute $G$ up to the end of an episode for every first-visited state if you're using the first-visit MC implementation.

However, your exercise 5.5 clearly suggests that this backward iteration algo shouldn't change the natural order in any episode to count as the first-visit of a state which obviously affects its value estimate via the accumulative computation of $G$ which happens at every step of any episode in your referenced pseudocode. Therefore you need to focus on the correct condition in above pseudocode to count as the first-visited state in natural order within the backward iteration. Take the example of exercise 5.5, when $t=T-1=9$, the state-action pair $(s_9, a_9) = (s, a)$ does appear in the episodic sequence $(s_0, a_0, s_1, a_1, ..., s_8, a_8) = (s, a, s, a, ..., s, a)$ since there's only a single nonterminal state $s$ and a single action $a$ given in this exercise. So only until step $t=0$, the first-visited state condition is really satisfied and now you enter the final block to get the state value or action value depending on your goal. We can use below python code to show how to code up the critical first-visit state condition to arrive at the correct value function $V(s)=10$ in the one episode described in exercise 5.5.

episode = [('s', 1) for _ in range(10)]
state_sequence = [item[0] for item in episode]
G = 0
gamma = 1
first_visited_states = set()
value_estimates = {}

for t in range(len(episode)-1, -1, -1): state, reward = episode[t] G = gamma * G + reward state_sequence = state_sequence[0:t] # print(state_sequence) if state not in state_sequence: first_visited_states.add(state) value_estimates[state] = G

print("V(s)=", value_estimates['s'])

The result is expected as:

V(s)= 10

Finally the samples obtained by the every-visit strategy are dependent so they're not i.i.d. samples compared to first-visit strategy. Nevertheless, if the two visits are far away from each other, the dependency would not be strong.

Both first-visit MC and every-visit MC converge to $v_π(s)$ as the number of visits (or first visits) to $s$ goes to infinity. This is easy to see for the case of first-visit MC. In this case each return is an independent, identically distributed estimate of $v_π(s)$ with finite variance. By the law of large numbers the sequence of averages of these estimates converges to their expected value. Each average is itself an unbiased estimate, and the standard deviation of its error falls as $1/ \sqrt n$, where $n$ is the number of returns averaged. Every-visit MC is less straightforward, but its estimates also converge quadratically to $v_π(s)$ (Singh and Sutton, 1996).

cinch
  • 11,000
  • 3
  • 8
  • 17