Can we use Q-learning update for policy evaluation (not control)?

Question

For policy evaluation purposes, can we use the Q-learning algorithm even though, technically, it is meant for control?

Maybe like this:

Have the policy to be evaluated as the behaviour policy.
Update the Q value conventionally (i.e. updating $Q(s,a)$ using the action $a'$ giving highest $Q(s',a')$ value)
The final $Q(s,a)$ values will reflect the values for the policy being evaluated.

Am I missing something here, given that I have not seen Q-learning being used anywhere for evaluation purposes?

Neil Slater · Accepted Answer · 2020-11-15T23:31:42.603

For off-policy learning you must have two policies - a behaviour policy and a target policy. If the two policies are the same, then you end up with SARSA, not Q learning.

You cannot use Q learning directly for evaluating a fixed target policy, because it directly learns optimal value function as the target policy, regardless of the behaviour policy. Instead you must use another variant of off-policy learning that can evaluate an arbitrary target policy.

Your suggested algorithm is:

Have the policy to be evaluated as the behaviour policy.

Update the Q value conventionally (i.e. updating $Q(s,a)$ using the action $a'$ giving highest $Q(s',a')$ value)

The final $Q(s,a)$ values will reflect the values for the policy being evaluated.

This will not work for evaluating the behaviour policy. If the behaviour policy was stochastic and covered all possible state/action choices, then it will still be Q learning and converge on the optimal value function - maybe very slowly if the behaviour policy did not get to important states very often.

The "trick" to off-policy is that the environment interaction part uses the behaviour policy to collect data, and the update step uses the target policy to calculate estimated returns. In general for off-policy updates, there can be corrections required to re-weight the estimated returns. However, one nice thing about single-step TD methods is that there are no such additional corrections needed.

So this gives a way to do off-policy TD learning, using an approach called Expected SARSA. To use Expected SARSA, you will need to know the distribution of action choices i.e. know $\pi(a|s)$ for the target policy.

This is the variant of your description that will work to evaluate your target policy $\pi(a|s)$:

Have any stochastic policy that "covers" the target policy as the behaviour policy.
Update the Q value using Expected SARSA $Q(s,a) = Q(s,a) + \alpha(r + \gamma [\sum_{a'} \pi(a'|s')Q(s',a')] - Q(s,a))$
The final $Q(s,a)$ values will reflect the values for the policy being evaluated.

Worth noting that Expected SARSA with a target policy of $\pi(s) = \text{argmax}_a Q(s,a)$ is exactly Q learning. Expected SARSA is a strict generalisation of Q learning that allows for learning the value function of any target policy. You may not see it used as much as Q learning, because the goal of learning an optimal value function is more common in practice.

Can we use Q-learning update for policy evaluation (not control)?

1 Answers1

Linked