Can Q(s,a) be replaced by V(s) when certrain requirements are met?

Question

I read this post, was thinking about it and now I have a hypothesis but I am not sure whether or not its correct. I claim that in Q-learning $Q(s,a)$ can be replaced by V(s) when

$p(s'|a,s)$ is deterministic, so $p(s'|A=a,S=s) = 1$
the policy is deterministic, so $\pi(A \mid S = s)$ is just $\pi(s)$
the reward function $R(s, a, s')$ is independent of the action and thus just $R(s, s')$
It is possible to enumerate all next states $s'$ for a given state $s$.

I firstly implemented that as a very basic Deep Q-Learning (or better Deep V-Learning) network and used OpenAI's gym environment frozen lake. Frozen lake has deterministic transition probabilities, the states are known and the rewards are bound to the state, independent of the action. So it fulfills my requirements. I implemented only one neural network which is used for policy and target prediction. The network outputs only 1 value (V(s) rather than 1 value for each state action pair. I used a replay buffer and employed epsilon greedy. Epsilon greedy yields either a random action or it calculates $\max_{s'} V^\pi(s')$. It learned an optimal policy so I tried to proof my claim formally.

Proof:

Per definition:

$Q^\pi(s,a) = \sum_{s'\in S}p(s'|a,s)[R(s, a, s') + \gamma V^\pi(s')]$

Since $p(s'|a,s) = 1$ and $R(s,a,s')=R(s,s')$, action $a$ vanishes and so:

$Q^\pi(s) = \sum_{s'}[R(s,s') + \gamma V^\pi(s')]$

So $Q^\pi(s) = Q^\pi(s, a)$.

Next, per definition:

$V^\pi(s) = \sum_{a \in A}\pi(a|s) * Q^\pi(s,a)$

Since $\pi(a| s) = 1$ we can remove the sum and use $Q^\pi(s)$.

So $V^\pi(s) = Q^\pi(s)$.

Per definition, the Q-formula:

$Q^\pi(s,a) \leftarrow Q^\pi(s,a) + \alpha[R(s, a, s') + \gamma \max_a Q^\pi(s',a)- Q^\pi(s,a)]$

Since we showed that $Q^\pi(s,a) = V^\pi(s)$ and $R(s,a,s') = R(s,s')$, we can say:

$V^\pi(s) \leftarrow V^\pi(s) + \alpha[R(s,s') + \gamma \max_{s'} V^\pi(s')- V^\pi(s)]$

What do you think?

UPDATE

Here is the code that shows that an optimal policy in Q-learning can be found with just V(s).

Neil Slater · Accepted Answer · 2024-04-07T14:00:36.233

Your criteria #4:

It is possible to enumerate all next states $s'$ for a given state $s$.

makes me think you are describing afterstate representation.

Yes, you can really ignore actions for model-free RL in that case, but it's not a matter of "cancelling out" actions starting from definitions in Q-learning. Instead, actions and action-value functions simply don't exist in this subset of reinforcement learning problems. The agent's "action" is to select the next state directly, there is no separate action space, and nothing need get formally labelled with a separate type of action ($a \in \mathcal{A}$). This is common in board games for example, although these have different complications to deal with adversarial agents.

Instead of trying to prove anything about the action value function Q (and the idea of an action value function $Q(s)$ which does not evaluate an action), you should start with different fundamental objects, e.g. your (stochastic) policy function is $\pi(s'|s): \mathcal{S} \times \mathcal{S} \rightarrow \mathbb{R}$, and the Bellman equation for value function is $v_{\pi}(s) = \sum_{s' \in \mathcal{S}} \pi(s'|s)(r(s, s') + \gamma v_{\pi}(s'))$

Note your criteria 2 (that the policy is deterministic) is not required.

Can Q(s,a) be replaced by V(s) when certrain requirements are met?

1 Answers1