I don't understand the difference between a policy and rewards. Sure, a policy tells us what to do, but isn't the output of a neural network trained on rewards basically a policy (i.e. choose the maximum reward)? What is different about the policy? An extra softmax applied?
2 Answers
A (stochastic) policy is a set of conditional probability distributions, $$\pi(a \mid S=s), \forall s \in \mathcal{S}.$$ If the policy is deterministic, then it is a function $$\pi: \mathcal{S} \rightarrow \mathcal{A},$$ so $\pi(s) = a$ is the action that policy $a$ returns in the state $s$ - it always produces this same action for a given state, unless it's a non-stationary policy. A policy is also called a strategy (in game theory). To be usable, a stochastic policy must be turned into a decision rule, i.e. you need to sample from it. A stochastic policy generalises a deterministic one.
The rewards are the outputs of the reward function. A reward function can be deterministic (which is often the case) or stochastic. It can be defined as $$R : \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{R},$$ where $\mathcal{R} \in \mathbb{R}$ is the reward space. If it's stochastic, then
\begin{align} R(s, a) &= \mathbb{E}\left[ R_t \mid S_t = s, A_t = a\right] \\ &=\sum_r r p(r \mid S_t = a, A_t=a), \end{align} where $R_t$ is the random variable that represents the reward at time step $t$ and assuming a finite MDP. Stochastic reward functions generalise deterministic ones.
So, policies are probability distributions or functions, while rewards are numbers. So, there's a difference between their definitions, even though they are related.
How are they related? In different ways. The most important one is that an optimal policy for a given MDP is the one that, if followed, maximises the expected return, which is a function of the reward (typically, a discounted sum of rewards). The definition of an optimal policy makes more sense if you also know the definition of a value function - I recommend you take Sutton & Barto's book and read the relevant sections.
- 42,615
- 12
- 119
- 217
Typically, the answer to a control problem in reinforcement learning (RL), is "What is the policy that maximises total reward?".
In really simple scenarios, that you might study to understand RL basics, this can be so obvious that you could just search ahead and discover the correct action without really using RL.
However, sticking to the formal definitions in RL allows you to tackle harder problems, where it is not obvious how to decide what to do, or maybe even how to access the best rewards.
In the formalism of RL:
Reward is a real-valued signal received after each time step. For RL theory to be useful, the distrbution of reward values should be the same if start state and action are the same (you may also make the distribution depend on the next state, but that doesn't change the rule, it just means reward and next state will be correlated).
A policy is a function for action choice, it takes the state as input, and returns a distribution over all possible actions $\pi(a | s): \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} = \mathbb{Pr}\{A_t=a|S_t=s\}$. It may be deterministic and directly return a chosen action $\pi(s): \mathcal{S} \rightarrow \mathcal{A}$
These are clearly very different things. Often we are interested in finding a policy that maximises an aggregate of rewards, but that doesn't make the policy simply related to rewards. Think of a chess game - which piece should you move to win? There is no immediate reward, and the consequence for a good or bad move will happen much later. The relationship between the policy in an early game state, and the reward at the end (e.g. +1 for a win, -1 for a loss) is not at all clear.
Only in the very simplest of scenarios, where taking actions directly led to already-predictable rewards, could you use your idea of $\pi(s) = \text{argmax}_a r(s, a)$ where $r(s,a)$ was a reward function. For a start, this will not help you look ahead more than one time step. What if the best reward now was followed by a really bad reward?
RL has the toolkit you can use to decide how to offset immediate vs future gains, and also how to learn what to do when you don't already have a simple function $r(s,a)$ which tells you what is going to happen in advance.
- 33,739
- 3
- 47
- 66