2

In Temporal-Difference Learning, we update our value function by $V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)\right)$

If we choose a constant $\alpha$, will the algorithm eventually give us the true state value function? Why or why not?

nbro
  • 42,615
  • 12
  • 119
  • 217
XXX
  • 143
  • 5

2 Answers2

3

It should be noted that the selection of $\alpha$ is a classic problem in stochastic approximation, rather than a specific problem in RL. Once you know this it will be clear.

What is stochastic approximation? As its name suggests, it is a method that uses data to approximate (typically) expectations. For example, suppose \begin{align*} w=\mathbb{E}[X] \end{align*} where $X$ is a random variable. If we have some iid samples of $X$ as $\{x_i\}$, then we can estimate $w$ by \begin{align*} w_{k+1}=w_k-a _k(w_k-x_k). \end{align*}

Here, you note that $a _k$ is time-varying instead of a constant. In fact, in order to ensure the (almost surely) convergence of $w_k$, a necessary condition is \begin{align*} \sum_k a _k&=\infty\\ \sum_k a_k^2&<\infty \end{align*} Why such a condition is required? A rigorous proof can be found in Robbins-Monro Algorithm. Here, I merely show some intuition why it is necessary.

  • First, $\sum_{k=1}^\infty a_k=\infty$ says that have $a_k$ should be sufficiently large in order to counter arbitrary initial conditions. In particular, the mathematically reasoning is as follows: hypothetically if $\sum_{k=1}^\infty a_k<\infty$, and also $\delta_k\doteq w_k-x_k$ is bounded, we have $\sum_{k=1}^\infty a_k \delta_k<\infty$. As a result, the difference between $w_\infty$ and $w_1$ is bounded. If the initial condition is very far from the solution, then it is not able to converge to the true solution.
  • Secondly, the condition of $\sum_k a _k^2<\infty$ says that $a_k$ should converge to zero. In particular, mathematically, the difference between $w_k$ and $w_{k+1}$is $a_k\delta_k$. If $a_k$ does not go to zero, then $w_k$ and $w_{k+1}$ will still fluctuate significantly after when $k$ is very large.

Of course, if $a_k=\alpha$ is constant, then $\sum_k a _k^2<\infty$ is not satisfied.

Now, let's come back to the TD algorithm. In fact, it can be viewed as a stochastic approximation algorithm. In particular, recall that the Bellman equation is \begin{align} v_\pi(s)=\mathbb{E}[R+\gamma v_\pi(S')|s],\quad \forall s \end{align} If we write $X\doteq R+\gamma v_\pi(S')$, then it becomes $v_\pi(s)=\mathbb{E}[X|s]$, which is very similar to the case of $w=\mathbb{E}[X]$. Hence, the stochastic approximation algorithm solve such a equation using data is \begin{align}\label{eq_SAAlgorithmSolvingBellmanEquation} v_{t+1}(s_t) &=v_t(s_t)-a _t(s_t)[v_t(s_t)-(r_t+\gamma v_t(s_{t+1}))],\qquad t=1,2,\dots \end{align}

Now, you may see how TD is obtained and why the step size should be $a _t(s_t)$ instead of a constant.

0

In general, NO.

You don't get the "true" state value function. TD-learning approximates the true value function. It can be a very close, or even exact approximation in simple cases, but, in general, it is just an approximation.

Depending on the difficulty of the problem, a non-constant $\alpha$ value can help the policy approximate the true value function more quickly, or help the learning from getting stuck in a local minimum.

There are implementations, like ADAM, which will adaptively change the learning rate for each feature.

Usually, you can expect convergence must faster if you adaptively change the learning rate by using an implementation like ADAM (see this).

nbro
  • 42,615
  • 12
  • 119
  • 217
John Rothman
  • 118
  • 5