2

I am new to reinforcement learning, but, for a finite horizon application problem, I am considering using the average reward instead of the sum of rewards as the objective. Specifically, there are a total of $T$ maximally possible time steps (e.g., the usage rate of an app in each time-step), in each time-step, the reward may be 0 or 1. The goal is to maximize the daily average usage rate.

Episode length ($T$) is maximally 10. $T$ is the maximum time window the product can observe about a user's behavior of the chosen data. There is an indicator value in the data indicating whether an episode terminates. From the data, it is offline learning, so in each episode, $T$ is given in the data. As long as an episode doesn't terminate, there is a reward of $\{0, 1\}$ in each time-step.

I heard if I use an average reward for the finite horizon, the optimal policy is no longer a stationary policy, and optimal $Q$ function depends on time. I am wondering why this is the case.

I see normally, the objective is defined maximizing

$$\sum_t^T \gamma^t r_t$$

And I am considering two types of average reward definition.

  1. $1/T(\sum^_{t=0}\gamma^t r_t)$, $T$ varies is in each episode.

  2. $1/(T-t)\sum^T_{i=t-1}\gamma^i r_i$

nbro
  • 42,615
  • 12
  • 119
  • 217
lll
  • 121
  • 2

0 Answers0