Why do Bellman equations indirectly create a policy?

Question

I was watching a lecture on policy gradients and Bellman equations. And they say that a Bellman equation indirectly creates a policy, while the policy gradient directly learns a policy. Why is this?

score 2 · Accepted Answer · edited Apr 03 '20 at 19:50

Policy gradient methods directly learn parameters of a policy function, which is a mapping from states to actions. For example, $p(s, a)$ can denote a function which takes a state $s$ and an action $a$ as input, and returns a probability of taking action $a$ in state $s$ as output (equivalently it could just take $s$ as input, and output a vector or a distribution of probabilities for all actions). Such a function will often have some parameters (for example, a vector of weights if it's a linear function of the state, or matrices of weights in the case of a neural network). Policy gradient methods directly modify the function by modifying (learning) the parameters of that function.

The Bellman equation is typically used to learn a function that computes different kinds of values, such as $Q(s, a)$ values, which can intuitively be understood to denote all the future rewards that we expect to get if we take action $a$ in state $s$ (and follow a certain policy, typically "the optimal policy" afterwards until the end of time). Such a $Q(s, a)$ function that we learn is not a directly learned policy, because it is not a direct mapping from current states to actions, it doesn't directly tell us which action to take.

It can still be used indirectly to create a policy. For example, once such a $Q(s, a)$ function is learned, the greedy policy can be easily implemented by looping over all possible actions, and taking the action that gives the greatest $Q(s, a)$ output. This is indirect because the learned function does not directly give you the actions, but only provides information which you can convert into actions with some extra processing.

Why do Bellman equations indirectly create a policy?

1 Answers1