How would I compute the optimal state-action value for a certain state and action?

Question

I am currently trying to learn reinforcement learning and I started with the basic gridworld application. I tried Q-learning with the following parameters:

Learning rate = 0.1
Discount factor = 0.95
Exploration rate = 0.1
Default reward = 0
The final reward (for reaching the trophy) = 1

After 500 episodes I got the following results:

How would I compute the optimal state-action value, for example, for state 2, where the agent is standing, and action south?

My intuition was to use the following update rule of the $q$ function:

$$Q[s, a] = Q[s, a] + \alpha (r + \gamma \max_{a'}Q[s', a'] — Q[s, a])$$

But I am not sure of it. The math doesn't add up for me (when using the update rule).

I am also wondering either I should use the backup diagram to find the optimal state-action q value by propagating the reward (gained from reaching the trophy) to the state in question.

For reference, this is where I learned about the backup diagram.

score 4 · Accepted Answer · answered Feb 21 '21 at 23:09

It seems that you are getting confused between the definition of a Q-value and the update rule used to obtain these Q-values.

Remember that to simply obtain an optimal Q-value for a given state-action pair we can evaluate

$$Q(s, a) = r + \gamma \max_{a'} Q(s', a)\;;$$

where $s'$ is the state we transitioned into (note that this only holds when obtaining the optimal Q-value, if we were using a stochastic policy then we would have to introduce expectations).

Now, this assumes that we have been given/obtained the optimal Q-values. To obtain them, we have to use the update rule (or any other learning algorithm) that you mentioned in your question.

How would I compute the optimal state-action value for a certain state and action?

1 Answers1