How do we estimate the value of a stochastic policy?

Question

I'm learning about reinforcement learning, particularly policy gradient methods and actor-critic methods. I've noticed that many algortihms use stochastic policies during training (i.e. they select the actions from a probability distribution).

I don't understand how the value function for stochastic policies is estimated accurately. The value function $V^{\pi}(s)$ is the expected return when starting at state $s$ and following policy $\pi$, but if we are choosing actions randomly then we are not really following any particular policy.

How are we getting an accurate estimate of the value of policy $\pi$ if we are not following a deterministic policy $\pi$ during training?

The value function is the expected (discounted) future rewards from the current state. The expectation is taking with respect to the policy, so if the policy is stochastic then it is measuring the returns under the policy in expectation, I.e. what the policy does on average. — David, Jun 14 '22 at 13:29
"I've noticed that many algortihms use stochastic policies during training", I know that you're talking about policy gradients and actor-critic algorithms, but I recommend that you provide 1-2 examples of a specific algorithm that does this so that there's no ambiguity and people can answer your question more concretely. — nbro, Jun 14 '22 at 14:31

Lee Reeves · Answer 1 · 2022-06-15T02:08:31.380

We are choosing actions randomly with probabilities given by the policy $\pi$. For example, one policy might make two actions A and B equally likely, another might choose A 90% of the time.

What we're actually calculating is not the value of a particular path, but the expected value of following the policy (sort of the average value).

To calculate the expected value, we sum (over all possible actions) the probability of each action (as given by the policy) times the return from that action.

$V^\pi(s) = \sum_a \pi(a | s) [R(s,a) + \gamma V^\pi(s')]$

If the transitions are also stochastic, we do a similar sum over next states:

$V^\pi(s) = \sum_a \pi(a | s) \sum_{s'} P(s' | a,s) [R(s,a,s') + \gamma V^\pi(s')]$

score 0 · Answer 2 · answered Sep 08 '23 at 18:28

Deterministic policies are just a special case of stochastic policies. The probability distribution of a deterministic policy has a single spike of value 1 for the deterministic action.

You are right that the value function keeps changing based on the random action chosen but all of these changes cause the value function to converge to a fixed point which is the value of that policy. This follows directly from the law of large numbers.

How do we estimate the value of a stochastic policy?

2 Answers2