5

I am trying to solve for $\lambda$ using temporal-difference learning. More specifically, I am trying to figure out what $\lambda$ I need, such that $\text{TD}(\lambda)=\text{TD}(1)$, after one iteration. But I get the incorrect value of $\lambda$.

Here's my implementation.

from scipy.optimize import fsolve,leastsq
import numpy as np



 class TD_lambda:
        def __init__(self, probToState,valueEstimates,rewards):
            self.probToState = probToState
            self.valueEstimates = valueEstimates
            self.rewards = rewards
            self.td1 = self.get_vs0(1)

        def get_vs0(self,lambda_):
            probToState = self.probToState
            valueEstimates = self.valueEstimates
            rewards = self.rewards
            vs = dict(zip(['vs0','vs1','vs2','vs3','vs4','vs5','vs6'],list(valueEstimates)))

            vs5 = vs['vs5'] + 1*(rewards[6]+1*vs['vs6']-vs['vs5'])
            vs4 = vs['vs4'] + 1*(rewards[5]+lambda_*rewards[6]+lambda_*vs['vs6']+(1-lambda_)*vs['vs5']-vs['vs4'])
            vs3 = vs['vs3'] + 1*(rewards[4]+lambda_*rewards[5]+lambda_**2*rewards[6]+lambda_**2*vs['vs6']+lambda_*(1-lambda_)*vs['vs5']+(1-lambda_)*vs['vs4']-vs['vs3'])
            vs1 = vs['vs1'] + 1*(rewards[2]+lambda_*rewards[4]+lambda_**2*rewards[5]+lambda_**3*rewards[6]+lambda_**3*vs['vs6']+lambda_**2*(1-lambda_)*vs['vs5']+lambda_*(1-lambda_)*vs['vs4']+\
                                (1-lambda_)*vs['vs3']-vs['vs1'])
            vs2 = vs['vs2'] + 1*(rewards[3]+lambda_*rewards[4]+lambda_**2*rewards[5]+lambda_**3*rewards[6]+lambda_**3*vs['vs6']+lambda_**2*(1-lambda_)*vs['vs5']+lambda_*(1-lambda_)*vs['vs4']+\
                                (1-lambda_)*vs['vs3']-vs['vs2'])
            vs0 = vs['vs0'] + probToState*(rewards[0]+lambda_*rewards[2]+lambda_**2*rewards[4]+lambda_**3*rewards[5]+lambda_**4*rewards[6]+lambda_**4*vs['vs6']+lambda_**3*(1-lambda_)*vs['vs5']+\
                                        +lambda_**2*(1-lambda_)*vs['vs4']+lambda_*(1-lambda_)*vs['vs3']+(1-lambda_)*vs['vs1']-vs['vs0']) +\
                    (1-probToState)*(rewards[1]+lambda_*rewards[3]+lambda_**2*rewards[4]+lambda_**3*rewards[5]+lambda_**4*rewards[6]+lambda_**4*vs['vs6']+lambda_**3*(1-lambda_)*vs['vs5']+\
                                        +lambda_**2*(1-lambda_)*vs['vs4']+lambda_*(1-lambda_)*vs['vs3']+(1-lambda_)*vs['vs2']-vs['vs0'])
            return vs0

        def get_lambda(self,x0=np.linspace(0.1,1,10)):
            return fsolve(lambda lambda_:self.get_vs0(lambda_)-self.td1, x0)

The expected output is: $0.20550275877409016$, but I am getting array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

I cannot understand what am I doing incorrectly.

TD = TD_lambda(probToState,valueEstimates,rewards)
TD.get_lambda()
# Output : array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

I am just using TD($\lambda$) for state 0 after one iteration. I am not required to see where it converges, so I don't update the value estimates.

nbro
  • 42,615
  • 12
  • 119
  • 217
Amanda
  • 205
  • 1
  • 5

2 Answers2

4

$TD(\lambda)$ return has the following form: \begin{equation} G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n} \end{equation} For you MDP $TD(1)$ looks like this: \begin{align} G &= 0.64 (r_0 + r_2 + r_4 + r_5 + r_6) + 0.36(r_1 + r_3 + r_4 + r_5 + r_6)\\ G &\approx 6.164 \end{align} $TD(\lambda)$ looks like this: \begin{equation} G_0^\lambda = (1-\lambda)[\lambda^0 G_{0:1} + \lambda^1 G_{0:2} + \lambda^2 G_{0:3} + \lambda^3 G_{0:4} + \lambda^4 G_{0:5} ] \end{equation} Now each $G$ term separately: \begin{align} G_{0:1} &= 0.64(r_0 + v_1) + 0.36(r_1 + v_2) \approx 7.864\\ G_{0:2} &= 0.64(r_0 + r_2 + v_3) + 0.36(r_1 + r_3 + v_3) \approx -5.336\\ G_{0:3} &= 0.64(r_0 + r_2 + r_4 + v_4) + 0.36(r_1 + r_3 + r_4 + v_4) \approx 25.864\\ G_{0:4} &= 0.64(r_0 + r_2 + r_4 + r_5 + v_5) + 0.36(r_1 + r_3 + r_4 + r_5 + v_5) \approx -11.936\\ G_{0:5} &= 0.64(r_0 + r_2 + r_4 + r_5 + r_6 + v_6) + 0.36(r_1 + r_3 + r_4 + r_5 + r_6 + v_6) \approx -0.336 \end{align} Finally, we need to find $\lambda$ so that the return is equal to $TD(1)$ return, we have: \begin{equation} 6.164 = (1 - \lambda)[7.864 - 5.336\lambda + 25.864\lambda^2 - 11.936\lambda^3 - 0.336\lambda^4] \end{equation} When you solve this equation, one of the solutions is $0.205029$ which is close to what you needed to get considering the numerical errors. Your problem was that you only considered probability in first state, but that decision prolongs to future states as well.

EDIT

As pointed out by bewestphal, this is not a full solution, it misses one crucial step to get it fully correct. Hint for it can be found in his answer and that's the correct solution.

Brale
  • 2,416
  • 1
  • 7
  • 15
1

The previous answer from Brale is mostly correct but is missing a large detail to get the precise answer.

Given this is a question from a GT course homework, I only want to leave pointers so those seeking help can understand the required concept.

() equation is a summation over infinite K-steps (0:1 -> 0:∞) and should be included in our equation of () = (1)

Every k-step estimator which included steps past the termination point will equal the sum of the rewards.

Including these values into the summation will show a pattern, making infinite summation equation solvable.

bewestphal
  • 11
  • 1