In reinforcement learning, the state-action value function seems to be used more than the state value function. Why is it so?
1 Answers
We are ultimately interested in getting an optimal policy, that is the optimal sequence of actions to reach the final goal. State values on its own don't provide that, they tell you expected return from specific state onward but they don't tell you which action to take. In order to derive an optimal action in a specific state you would have to simulate all possible actions one step ahead and then pick the action that leads you to the state with highest state value. That is often inconvenient or impossible. State action values connect the expected return with actions, not states, so you don't need to simulate all actions one step ahead and see where you end up, you only need to pick an action that has the highest value because you know that is the best action to take in that state.
- 2,416
- 1
- 7
- 15