In explanation of Example 8.1, Sutton and Barto's book says
Without planning $(n = 0)$, each episode adds only one additional step to the policy, and so only one step (the last) has been learned so far. With planning, again only one step is learned during the first episode, but here during the second episode an extensive policy has been developed that by the end of the episode will reach almost back to the start state.
My question is - why is only one step learned in the first episode with planning?
I agree with the fact that nothing is learned till the agent reaches the goal in both settings, but in the last "real step" when the agent reaches the goal, the planning steps if they choose the state and action adjacent to the recently updated square can learn a bit too.