I read this post, was thinking about it and now I have a hypothesis but I am not sure whether or not its correct. I claim that in Q-learning $Q(s,a)$ can be replaced by V(s) when
- $p(s'|a,s)$ is deterministic, so $p(s'|A=a,S=s) = 1$
- the policy is deterministic, so $\pi(A \mid S = s)$ is just $\pi(s)$
- the reward function $R(s, a, s')$ is independent of the action and thus just $R(s, s')$
- It is possible to enumerate all next states $s'$ for a given state $s$.
I firstly implemented that as a very basic Deep Q-Learning (or better Deep V-Learning) network and used OpenAI's gym environment frozen lake. Frozen lake has deterministic transition probabilities, the states are known and the rewards are bound to the state, independent of the action. So it fulfills my requirements. I implemented only one neural network which is used for policy and target prediction. The network outputs only 1 value (V(s) rather than 1 value for each state action pair. I used a replay buffer and employed epsilon greedy. Epsilon greedy yields either a random action or it calculates $\max_{s'} V^\pi(s')$. It learned an optimal policy so I tried to proof my claim formally.
Proof:
Per definition:
$Q^\pi(s,a) = \sum_{s'\in S}p(s'|a,s)[R(s, a, s') + \gamma V^\pi(s')]$
Since $p(s'|a,s) = 1$ and $R(s,a,s')=R(s,s')$, action $a$ vanishes and so:
$Q^\pi(s) = \sum_{s'}[R(s,s') + \gamma V^\pi(s')]$
So $Q^\pi(s) = Q^\pi(s, a)$.
Next, per definition:
$V^\pi(s) = \sum_{a \in A}\pi(a|s) * Q^\pi(s,a)$
Since $\pi(a| s) = 1$ we can remove the sum and use $Q^\pi(s)$.
So $V^\pi(s) = Q^\pi(s)$.
Per definition, the Q-formula:
$Q^\pi(s,a) \leftarrow Q^\pi(s,a) + \alpha[R(s, a, s') + \gamma \max_a Q^\pi(s',a)- Q^\pi(s,a)]$
Since we showed that $Q^\pi(s,a) = V^\pi(s)$ and $R(s,a,s') = R(s,s')$, we can say:
$V^\pi(s) \leftarrow V^\pi(s) + \alpha[R(s,s') + \gamma \max_{s'} V^\pi(s')- V^\pi(s)]$
What do you think?
UPDATE
Here is the code that shows that an optimal policy in Q-learning can be found with just V(s).