5

I am working on this assignment where I made the agent learn state-action values (Q-values) with Q-learning and 100% exploration rate. The environment is the classic gridworld as shown in the following picture.

enter image description here

Here are the values of my parameters.

  • Learning rate = 0.1
  • Discount factor = 0.95
  • Default reward = 0

Reaching the trophy is the final reward, no negative reward is given for bumping into walls or for taking a step.

After 500 episodes, the arrows have converged. As shown in the figure, some states have longer arrows than others (i.e., larger Q-values). Why is this so? I don't understand how the agent learns and finds the optimal actions and states when the exploration rate is 100% (each action: N-S-E-W has 25% chance to be selected)

Rim Sleimi
  • 215
  • 1
  • 8

1 Answers1

5

Q-learning is guaranteed to converge (in the tabular case) under some mild conditions, one of which is that in the limit we visit each state-action tuple infinitely many times. If your random random policy (i.e. 100% exploration) is guaranteeing this and the other conditions are met (which they probably are) then Q-learning will converge.

The reason that different state-action pairs have longer arrows, i.e. higher Q-values, is simply because the value of being in that state-action pair is higher. An example would be the arrow pointing down right above the trophy -- obviously this has the highest Q-value as the return is 1. For all other states it will be $\gamma^k$ for some $k$ -- to see this remember that a Q-value is defined as

$$Q(s, a) = \mathbb{E}_\pi \left[\sum_{j=0}^\infty \gamma^j R_{t+j+1} |S_t = s, A_t = a \right]\;;$$ so for any state-action pair that is not the block above the trophy with the down arrow $\sum_{j=0}^\infty \gamma^j R_{t+j+1}$ will be a sum of $0$'s plus $\gamma^T$ where $T$ is the time that you finally reach the trophy (assuming you give a reward of 1 for reaching the trophy).

David
  • 5,100
  • 1
  • 11
  • 33