1

My understanding of tabular Q-learning is that it essentially builds a dictionary of state-action pairs, so as to maximize the Markovian (i.e., step-wise, history-agnostic?) reward. This incremental update of the Q-table can be done by a trade-off exploration and exploitation, but the fact remains that one "walks around" the table until it converges to optimality.

But what if we haven't "walked around" the whole table? Can the algorithm still perform well in those out-of-sample state-action pairs?

nbro
  • 42,615
  • 12
  • 119
  • 217
Tfovid
  • 187
  • 7

1 Answers1

1

In the tabular case, then the Q table will only converge if you have walked around the whole of the table. Note that to guarantee convergence we need $\sum\limits_{n=1}^{\infty}\alpha_n(a) = \infty$ and $\sum\limits_{n=1}^\infty \alpha_n^2(a) < \infty$. These conditions imply that in the limit each state-action pair will have been visited an infinite number of times, thus we will have walked around the whole table, so there are no out-of-sample state-action pairs.

However, in the case of function approximation, then convergence is no longer guaranteed. Generalisability is possible though - assuming we have an infinite state or action space then we will only ever visit the same state-action pair once, so the role of a function approximator is to allow us to generalise the state/action space.

NB that the convergence conditions I mentioned are only required in some proofs of convergence, depending on what type of convergence you are looking to prove. See this answer for more details.

nbro
  • 42,615
  • 12
  • 119
  • 217
David
  • 5,100
  • 1
  • 11
  • 33