3

I'm new to reinforcement learning.

As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time, $\epsilon$ decreases and we often choose the best action.

  1. I was wondering, e.g. in Q-Learning, if $\epsilon$ is small, e.g. 0.1 or 0.01, do the Q-values really still change? Do they just change their direction, i.e. the best action remains the best action but the Q-values diverge further, or do the values really change again so that the best action always changes for a given state?

  2. If the Q-values really do still change strongly, is it because of the remaining random actions, which we still have at $\epsilon>0$ or would it still change at $\epsilon=0$?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

2
  1. How much the $Q$-values change does not depend on the value of $\epsilon$, rather the value of $\epsilon$ dictates how likely you are to take a random action and thus take an action that could give rise to a large TD error -- that is a large difference between the returns you expected from taking this action as to what you actually observed. How much the $Q$-value changes depends on the magnitude of this TD error.

  2. $Q$-learning is not guaranteed to converge if there is no exploration. Part of the convergence criteria assumes that each state-action pair will be visited infinitely often in an infinite number of episodes, and so if there is no exploration then this will not happen.

David
  • 5,100
  • 1
  • 11
  • 33