3

In the average reward setting, the quality of a policy is defined as: $$ r(\pi) = \lim_{h\to\infty}\frac{1}{h} \sum_{j=1}^{h}E[R_j] $$ When we reach the steady state distribution, we can write the above equation as follows: $$ r(\pi) = \lim_{t\to\infty}E[R_t | A \sim \pi] $$ We can use the incremental update method to find $r(\pi)$: $$ r(\pi) = \frac{1}{t} \sum_{j=1}^{t} R_j = \bar R_{t-1} + \beta (R_t - \bar R_{t-1})$$ where $ \bar R_{t-1}$ is the estimate of the average reward $r(\pi)$ at time step $t-1$. We use this incremental update rule in the SARSA algorithm: enter image description here

Now, in this above algorithm, we can see that the policy will change with respect to time. But to calculate the $r(\pi)$, the agent should follow the policy $\pi$ for a long period of time. Then how we are using $r(\pi)$ if the policy changes with respect to time?

nbro
  • 42,615
  • 12
  • 119
  • 217
Swakshar Deb
  • 703
  • 4
  • 12

1 Answers1

2

You are correct: to evaluate a policy, we need to fix it.

  • We can temporarily fix it, just to evaluate it over a number of test cases. For a fair comparison, we should fix the start states and random seeds used for the transitions.
  • We can wait until convergence / until we are satisfied. The resulting policy would be what we implement in the "true", trained agent. This is important when exploration might be harmful in the "real world" domain where the agent will be operating.
  • We can also measure average reward of the "non-stationary" policy, and assume that, once the agent is doing well, this should be close enough to evaluating the fixed policy. This is not ideal, but on the other hand it is trivial to implement, and is often used to track the learning process. If you have a life-long learning agent, this might be the best you can do.