I was watching a lecture on policy gradients and Bellman equations. And they say that a Bellman equation indirectly creates a policy, while the policy gradient directly learns a policy. Why is this?
1 Answers
Policy gradient methods directly learn parameters of a policy function, which is a mapping from states to actions. For example, $p(s, a)$ can denote a function which takes a state $s$ and an action $a$ as input, and returns a probability of taking action $a$ in state $s$ as output (equivalently it could just take $s$ as input, and output a vector or a distribution of probabilities for all actions). Such a function will often have some parameters (for example, a vector of weights if it's a linear function of the state, or matrices of weights in the case of a neural network). Policy gradient methods directly modify the function by modifying (learning) the parameters of that function.
The Bellman equation is typically used to learn a function that computes different kinds of values, such as $Q(s, a)$ values, which can intuitively be understood to denote all the future rewards that we expect to get if we take action $a$ in state $s$ (and follow a certain policy, typically "the optimal policy" afterwards until the end of time). Such a $Q(s, a)$ function that we learn is not a directly learned policy, because it is not a direct mapping from current states to actions, it doesn't directly tell us which action to take.
It can still be used indirectly to create a policy. For example, once such a $Q(s, a)$ function is learned, the greedy policy can be easily implemented by looping over all possible actions, and taking the action that gives the greatest $Q(s, a)$ output. This is indirect because the learned function does not directly give you the actions, but only provides information which you can convert into actions with some extra processing.
- 42,615
- 12
- 119
- 217
- 10,519
- 2
- 29
- 70