3

Is there any empirical/theoretical evidence on the effect of initial values of state-action and state values on the training of an RL agent (the values an RL agent assigns to visited states) via MC methods Policy Evaluation and GLIE Policy Improvement?

For example, consider two initialization scenarios of Windy Gridworld problem:

Implementation: I have modified the problem along with step penalty to include a non-desired terminal state and a desired terminal state which will be conveyed to the agent as a negative and positive reward state respectively. The implementation takes care that the MC sampling ends at the terminal state and gives out penalty/reward as a state-action value and not state value, since this is a control problem. Also, I have 5 moves: north, south, east, west and stay.

NOTE: I am not sure whether this changes the objective of the problem. In the original problem, it was to reduce the number of steps required to reach the final stage.

  • We set the reward of reaching the desired terminal state to a value that is higher than the randomly initialized values of the value function; for example, we can set the reward to $20$ and initialize the values with random numbers in the range $[1, 7]$

  • We set the reward of reaching the desired terminal state to a value that is comparable to the randomly initialized values of the value functions; for example, we can set the reward to $5$ and initialize the values with random numbers in the range $[1, 10]$

As far as I see, in the first case, the algorithm will easily quickly converge as the reward is very high for the terminal reward state which will skew the agent to try to reach the reward stage.

In the second case, this might not be true if the reward state is surrounded by other high reward states, the agent will try to go to those states.

The step penalty ensures that the agent finally reaches the terminal state, but will this skew the path of the agent and severely affect its convergence time? This might be problematic in large state spaces since we will not be able to explore the entire state space, but the presence of exploratory constant $\epsilon$ might derail the training by going to a large false reward state. Is my understanding correct?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

5

There seem to be two different ideas in this question here:

  1. What's the impact / importance of our choice for reward values?
  2. What's the impact / importance of our choice for initial value estimates (how do we initialise our table of $Q(s, a)$ values in the case of a simple, tabular RL algorithm like Sarsa or $Q$-learning)?

The reward values are typically assumed to be a part of the problem definition - something we shouldn't modify if we're using an existing problem definition as a benchmark. But if we're in charge of defining the problem ourselves, we can of course also pick the reward values. Modifying them may indeed have a huge impact on the speed with which RL algorithms are able to learn a task - but it may also intrinsically changes the task, it changes the objective of the problem, it may change which policies are optimal.


As for initialisation of our table of value approximations: by default, we normally assume an all-$0$ initialisation. However, it is a fairly common trick (in tabular RL algorithms, without function approximation) to initialise value estimates optimistically; pick higher initial $Q(s, a)$ value estimates than are likely (or even pick values higher than a known upper bound on what the true value possibly could be). This is often beneficial - also in large gridworlds with sparse rewards (e.g. a single distant goal somewhere) and negative rewards (i.e. costs) incurred for every step taken - because it incentivises exploration of state-action pairs that have not yet been tried.

Suppose you have your gridworld with negative rewards associated with every time step, and the optimal policy being one that takes you to a distant goal as soon as possible. If all $Q(s, a)$ are initialised to values below $0$, your agent may quickly learn that everything it does is equally bad anyway, and get stuck near the starting position. If all $Q(s, a)$ values are initialised optimistically (to at least $0$ if that exceeds any "ground truth" value of any state in the environment), your agent during the learning process will still have optimistic expectations of what it can achieve if it just tries to navigate to unexplored parts of the state-action space.

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70