4

David Silver argues, in his Reinforcement Learning course, that policy-based reinforcement learning (RL) is more effective than value-based RL in high-dimensional action spaces. He points out that the implicit policy (e.g., $\epsilon$-greedy) in Q-learning or Sarsa looks for a maximum for $Q_\pi$ in the action space in each time step, which may be infeasible when there are many actions.

However, from my understanding, policy gradient methods that use a neural network to represent the policy choose their actions from the output of a softmax:

$$ \pi(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}}, $$

where the denominator normalizes between all possible actions $b\in \mathcal{A}$ and $h(s, a, \theta)$ is a numerical preference (notation from the Reinforcement Learning book by Sutton and Barto) that is usually represented by a neural network.

By using a softmax in action preferences, don't we compute $\pi(a|s, \theta)$ for all actions, ending up with the same computational complexity as in value-based methods?

Saucy Goat
  • 153
  • 5

1 Answers1

1

Above softmax in action preferences is used for policy gradient methods with (large) spaces with discrete actions, while for continuous spaces with infinite number of actions Gaussian distribution is often used instead thus you don't need to compute all those potentially large amount of numerical preferences via NN or whatever models.

In the above softmax case sometimes you may simply fit the preferences as linear function of the feature vector $x(s,a)$ which is often much simpler than computing action values via Monte Carlo, TD or DP which all have much higher time complexity. Indeed in these cases policy gradient methods are more effective to learn since policy function is in fact much easier to approximate than action value function or certain types of domain knowledge can be incorporated in policy function as explained in the paper Why Most Decisions Are Easy in Tetris.

most of the problems are easy in the following sense: One can choose well among the available actions without knowing an evaluation function that scores well in the game. This is a consequence of three conditions that are prevalent in the game: simple dominance, cumulative dominance, and noncompensation. These conditions can be exploited to develop faster and more effective learning algorithms. In addition, they allow certain types of domain knowledge to be incorporated with ease into a learning algorithm

Finally if your softmax has to employ action values to compute the preferences such as in most actor-critic methods, then yes of course it has same computational complexity as its corresponding value-based methods during each time step which you cannot avoid. But additionally softmax would allow agent to explore different actions by sampling from the resulting distribution, rather than always select the action with the highest estimated value, thus it can find stochastic optimal policies in problems with heavy function approximations while "deterministic" value-based methods cannot.

cinch
  • 11,000
  • 3
  • 8
  • 17