I'm learning about reinforcement learning, particularly policy gradient methods and actor-critic methods. I've noticed that many algortihms use stochastic policies during training (i.e. they select the actions from a probability distribution).
I don't understand how the value function for stochastic policies is estimated accurately. The value function $V^{\pi}(s)$ is the expected return when starting at state $s$ and following policy $\pi$, but if we are choosing actions randomly then we are not really following any particular policy.
How are we getting an accurate estimate of the value of policy $\pi$ if we are not following a deterministic policy $\pi$ during training?