5

I am currently trying to learn reinforcement learning and I started with the basic gridworld application. I tried Q-learning with the following parameters:

  • Learning rate = 0.1
  • Discount factor = 0.95
  • Exploration rate = 0.1
  • Default reward = 0
  • The final reward (for reaching the trophy) = 1

After 500 episodes I got the following results:

enter image description here

How would I compute the optimal state-action value, for example, for state 2, where the agent is standing, and action south?

My intuition was to use the following update rule of the $q$ function:

$$Q[s, a] = Q[s, a] + \alpha (r + \gamma \max_{a'}Q[s', a'] — Q[s, a])$$

But I am not sure of it. The math doesn't add up for me (when using the update rule).

I am also wondering either I should use the backup diagram to find the optimal state-action q value by propagating the reward (gained from reaching the trophy) to the state in question.

For reference, this is where I learned about the backup diagram.

nbro
  • 42,615
  • 12
  • 119
  • 217
Rim Sleimi
  • 215
  • 1
  • 8

1 Answers1

4

It seems that you are getting confused between the definition of a Q-value and the update rule used to obtain these Q-values.

Remember that to simply obtain an optimal Q-value for a given state-action pair we can evaluate

$$Q(s, a) = r + \gamma \max_{a'} Q(s', a)\;;$$

where $s'$ is the state we transitioned into (note that this only holds when obtaining the optimal Q-value, if we were using a stochastic policy then we would have to introduce expectations).

Now, this assumes that we have been given/obtained the optimal Q-values. To obtain them, we have to use the update rule (or any other learning algorithm) that you mentioned in your question.

David
  • 5,100
  • 1
  • 11
  • 33