1

Temporal difference algorithms (TD($\lambda$)) are tabular solutions to reinforcement learning problems. That is, they create a table of all the states in the problem, and estimate the expected long-term reward that can be obtained from each state.

The policy used in these situations is: "in your current state, choose to transition to the state that maximises the long-term reward". I observe that the "long-term reward" of a terminal state is precisely 0, by definition (see here).

So, if the greedy policy is to "choose the state which maximises the long-term reward", and the "long-term reward" of a terminal state is 0, doesn't that mean that the policy should never choose it, even if it has an immediate reward of 1 million points?


Minimally, practically, my query can be observed in a 3x1 gridworld. The agent starts at 0, and the goal/positive reward is in 2.

The update rule for TD(0) is:

$$ V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right] $$

Where $R_{t+1}$ is the immediate reward for transitioning to $S_{t+1}$.

Running this in python, I get something like $V=[0.784, 0.895, 0]$. According to my understanding of what $V(S)$ should be, this is roughly the correct result. However, if I now try to apply the greedy policy ($\epsilon=0$), it will result in the agent swapping between state=0 and state=1 forever, and never opt to go to state=2, because the long-term expected reward of that state is, by definition, 0.

Clearly there is something very fundamental that I am not understanding or missing, but I don't know what it is, or how to find it.

import random

def random_policy(V, S):

At the edges - immediate return

if S == 0: return S+1 if S == len(V)-1: return S-1

In the middle - choose randomly

if random.random() > 0.5: return state+1 else: return state-1

def greedy_policy(V, S):

At the edges - immediate return

if S == 0: return S+1 if S == len(V)-1: return S-1

In the middle - choose the highest estimated V(S_{t+1})

a = V[S-1] b = V[S+1] if a < b: return S+1 else: return S-1

gridworld = [0, 0, 1] V = [0, 0, 0] ALPHA = 0.1 GAMMA = 0.9

for x in range(1000): # episodes state = 0 while state != 2: next_state = random_policy(V, state) reward = gridworld[next_state] V[state] += ALPHA(reward + GAMMAV[next_state] - V[state])

  state = next_state

print(V)

Multihunter
  • 148
  • 6

1 Answers1

2

Your greedy policy function is wrong.

It should be:

def greedy_policy(V, S):
   # At the edges - immediate return
   if S == 0:
      return S+1
   if S == len(V)-1:
      # Technically you should never reach here, because termination
      return S-1
   # In the middle - choose the highest estimated future return
   a = gridworld[S-1] + GAMMA * V[S-1]
   b = gridworld[S+1] + GAMMA * V[S+1]
   if a < b:
      return S+1
   else:
      return S-1

This makes sense, your state value function assumes you will follow this rule - if the greedy policy that you were evaluating really did step backwards and forwards between states 0 and 1, then the values of both those states would be $0$ because a reward is never realised.

In general, when working from a state value table $v(s)$, the greedy policy with respect to those values is given by:

$$\pi(s) = \text{argmax}_a \sum_{r,s'} p(r, s'|s,a)(r + \gamma v(s'))$$

Note this does not have to be the same policy as used to calculate $v(s)$, but in the special case of optimal value function (and therefore an optimal policy) it is.

If you want to avoid using the MDP model for transition and reward probabilities $p(r, s'|s,a)$, then you will need to work with action value estimates and use a method like SARSA($\lambda$) or Q($\lambda$). If you use an action value table $q(s,a)$ then the greedy policy is much simpler:

$$\pi(s) = \text{argmax}_a q(s,a)$$

In that case, you can select max values from your action values table.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66