1

I read in the book and when I saw the formula to optimize the $\theta$ $$ \theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta) \\ \nabla_\theta J(\pi_\theta) = E_(\tau ~ \pi_\theta)[\sum_{t=0}^T R_t(\tau)\nabla_\theta\log\pi_\theta(a_t|s_t)] $$ meaning that the policy is a function? Does it have to be it to calculate the gradient? so how do I identify that policy function, I still don't get the math, I searched and saw that people usually use the softmax function but the the parameter in that softmax function is something called a scorring function, how do I identified the scoring function?

1 Answers1

1

The policy $π_θ(a∣s)$ is a function that maps a state $s$ to a probability distribution over actions $a$ which is parameterized by $θ$ optimized to maximize the expected reward. The gradient $\nabla_\theta J(\pi_\theta)$ tells us how to adjust $θ$ to improve the policy. To compute this gradient, the policy must be differentiable wrt $θ$. The policy is commonly assumed to be softmax function ${\displaystyle {\frac {e^{f(s,a;θ)}}{\sum _{a'}e^{f(s,a';θ)}}}\,}$ for discrete actions and Gaussian for continuous actions.

A common scoring function $f(s,a;θ)$ takes the current state $s$ and a possible action $a$ and outputs a real-valued score aka logit, which is learned during training and is critical for determining which actions are more likely under the policy. If you use a NN as your parameterized policy function of discrete actions, then the logits as scoring functions are the raw outputs of the last layer in a neural network before applying the softmax activation function and the above gradient naturally involves scoring functions.

Logits are the raw outputs of the last layer in a neural network before applying the activation function, typically in the context of classification problems. They represent the unnormalized log probabilities of different classes... softmax is used to convert logits into probabilities

cinch
  • 11,000
  • 3
  • 8
  • 17