Using TD algorithms, if the value function of terminal states is always 0, why would a policy ever choose it?

Question

Temporal difference algorithms (TD($\lambda$)) are tabular solutions to reinforcement learning problems. That is, they create a table of all the states in the problem, and estimate the expected long-term reward that can be obtained from each state.

The policy used in these situations is: "in your current state, choose to transition to the state that maximises the long-term reward". I observe that the "long-term reward" of a terminal state is precisely 0, by definition (see here).

So, if the greedy policy is to "choose the state which maximises the long-term reward", and the "long-term reward" of a terminal state is 0, doesn't that mean that the policy should never choose it, even if it has an immediate reward of 1 million points?

Minimally, practically, my query can be observed in a 3x1 gridworld. The agent starts at 0, and the goal/positive reward is in 2.

The update rule for TD(0) is:

$$ V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right] $$

Where $R_{t+1}$ is the immediate reward for transitioning to $S_{t+1}$.

Running this in python, I get something like $V=[0.784, 0.895, 0]$. According to my understanding of what $V(S)$ should be, this is roughly the correct result. However, if I now try to apply the greedy policy ($\epsilon=0$), it will result in the agent swapping between state=0 and state=1 forever, and never opt to go to state=2, because the long-term expected reward of that state is, by definition, 0.

Clearly there is something very fundamental that I am not understanding or missing, but I don't know what it is, or how to find it.

import random
def random_policy(V, S):
At the edges - immediate return
if S == 0:
      return S+1
   if S == len(V)-1:
      return S-1
In the middle - choose randomly
if random.random() > 0.5:
      return state+1
   else:
      return state-1
def greedy_policy(V, S):
At the edges - immediate return
if S == 0:
      return S+1
   if S == len(V)-1:
      return S-1
In the middle - choose the highest estimated V(S_{t+1})
a = V[S-1]
   b = V[S+1]
   if a < b:
      return S+1
   else:
      return S-1
gridworld = [0, 0, 1]
V =         [0, 0, 0]
ALPHA = 0.1
GAMMA = 0.9
for x in range(1000): # episodes
   state = 0
   while state != 2:
      next_state = random_policy(V, state)
      reward = gridworld[next_state]
      V[state] += ALPHA(reward + GAMMAV[next_state] - V[state])
  state = next_state


print(V)

Neil Slater · Accepted Answer · 2022-11-25T09:18:19.883

Your greedy policy function is wrong.

It should be:

def greedy_policy(V, S):
   # At the edges - immediate return
   if S == 0:
      return S+1
   if S == len(V)-1:
      # Technically you should never reach here, because termination
      return S-1
   # In the middle - choose the highest estimated future return
   a = gridworld[S-1] + GAMMA * V[S-1]
   b = gridworld[S+1] + GAMMA * V[S+1]
   if a < b:
      return S+1
   else:
      return S-1

This makes sense, your state value function assumes you will follow this rule - if the greedy policy that you were evaluating really did step backwards and forwards between states 0 and 1, then the values of both those states would be $0$ because a reward is never realised.

In general, when working from a state value table $v(s)$, the greedy policy with respect to those values is given by:

$$\pi(s) = \text{argmax}_a \sum_{r,s'} p(r, s'|s,a)(r + \gamma v(s'))$$

Note this does not have to be the same policy as used to calculate $v(s)$, but in the special case of optimal value function (and therefore an optimal policy) it is.

If you want to avoid using the MDP model for transition and reward probabilities $p(r, s'|s,a)$, then you will need to work with action value estimates and use a method like SARSA($\lambda$) or Q($\lambda$). If you use an action value table $q(s,a)$ then the greedy policy is much simpler:

$$\pi(s) = \text{argmax}_a q(s,a)$$

In that case, you can select max values from your action values table.

Using TD algorithms, if the value function of terminal states is always 0, why would a policy ever choose it?

At the edges - immediate return

In the middle - choose randomly

At the edges - immediate return

In the middle - choose the highest estimated V(S_{t+1})

1 Answers1