Can we stop training as soon as epsilon is small?

Question

I'm new to reinforcement learning.

As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time, $\epsilon$ decreases and we often choose the best action.

I was wondering, e.g. in Q-Learning, if $\epsilon$ is small, e.g. 0.1 or 0.01, do the Q-values really still change? Do they just change their direction, i.e. the best action remains the best action but the Q-values diverge further, or do the values really change again so that the best action always changes for a given state?
If the Q-values really do still change strongly, is it because of the remaining random actions, which we still have at $\epsilon>0$ or would it still change at $\epsilon=0$?

score 2 · Answer 1 · answered Oct 01 '20 at 23:52

How much the $Q$-values change does not depend on the value of $\epsilon$, rather the value of $\epsilon$ dictates how likely you are to take a random action and thus take an action that could give rise to a large TD error -- that is a large difference between the returns you expected from taking this action as to what you actually observed. How much the $Q$-value changes depends on the magnitude of this TD error.
$Q$-learning is not guaranteed to converge if there is no exploration. Part of the convergence criteria assumes that each state-action pair will be visited infinitely often in an infinite number of episodes, and so if there is no exploration then this will not happen.

Can we stop training as soon as epsilon is small?

1 Answers1