1

In explanation of Example 8.1, Sutton and Barto's book says

Without planning $(n = 0)$, each episode adds only one additional step to the policy, and so only one step (the last) has been learned so far. With planning, again only one step is learned during the first episode, but here during the second episode an extensive policy has been developed that by the end of the episode will reach almost back to the start state.

My question is - why is only one step learned in the first episode with planning?

I agree with the fact that nothing is learned till the agent reaches the goal in both settings, but in the last "real step" when the agent reaches the goal, the planning steps if they choose the state and action adjacent to the recently updated square can learn a bit too.

nbro
  • 42,615
  • 12
  • 119
  • 217
Lazy Guy
  • 13
  • 3

1 Answers1

0

You're correct that the planning agent could allow value function updates to adjacent steps even in the first episode. However, the state-action pairs immediately adjacent to the goal state (only state with positive reward) might receive some updates during planning, but earlier states lack sufficient path experience to propagate value reliably since in the first episode the agent doesn’t yet have any significant amount of experience of earlier state-action pairs leading to the goal.

In the second episode, however, the difference isn't in the real experience which is still bottlenecked at the final goal state with positive reward, but in the growing ability of the planning agent to chain simulated value updates backward using accumulated path experience. Each episode incrementally improves the agent’s model of state-action pairs leading to the goal state from its path experience, enabling deeper credit assignment backward through planning.

cinch
  • 11,000
  • 3
  • 8
  • 17