7

Projected Bellman error has shown to be stable with linear function approximation. The technique is not at all new. I can only wonder why this technique is not adopted to use with non-linear function approximation (e.g. DQN)? Instead, a less theoretical justified target network is used.

I could come up with two possible explanations:

  1. It doesn't readily apply to non-linear function approximation case (some work needed)
  2. It doesn't yield a good solution. This is the case for true Bellman error but I'm not sure about the projected one.
nbro
  • 42,615
  • 12
  • 119
  • 217
Phizaz
  • 520
  • 3
  • 13

2 Answers2

3

I have found some clues in Maei's thesis (2011): “Gradient Temporal-Difference Learning Algorithms.”

According to the thesis:

  1. GTD2 is a method that minimizes the projected Bellman error (MSPBE).
  2. GTD2 is convergent in non-linear function approximation case (and off-policy).
  3. GTD2 converges to a TD-fixed point (same point as semi-gradient TD).
  4. GTD2 is slower to converge than usual semi-gradient TD.

It doesn't readily apply to non-linear function approximation.

No, it does.

It doesn't yield a good solution.

No, it does. TD-fixed point is the same point for the solution of semi-gradient TD (which is generally used). There is no edge on that.

The only explanation seems to be practical convergence rate.

To quote his words:

Some of our empirical results suggest that gradient-TD method maybe slower than conventional TD methods on problems on which conventional TD methods are sound (that is, on-policy learning problems).

Phizaz
  • 520
  • 3
  • 13
0

As I understand it above-mentioned projection operator project into linear feature subspace produced from set of feature vectors (or feature functions), that is space of linear combinations of features. Vanilla DQN don't have any feature space, projection into linear subspace doesn't make sense in DQN context. If you attempt to produce feature space for values/Q with some NN it wouldn't be DQN (because Q wouldn't be produced) and it wouldn't work anyway on anything but toy problems because amount of degrees of freedom of output would be too high.

mirror2image
  • 735
  • 7
  • 15