3

In Introduction to Reinforcement Learning (2nd edition) by Sutton and Barto, there is an example of the Pole-Balancing problem (Example 3.4).

In this example, they write that this problem can be treated as an episodic task or continuing task.

I think that it can only be treated as an episodic task because it has an end of playing, which is falling the rod.

I have no idea how this can be treated as continuing task. Even in OpenAI Gym cartpole env, there is only the episodic mode.

nbro
  • 42,615
  • 12
  • 119
  • 217
user3595632
  • 185
  • 1
  • 4

4 Answers4

3

It's a continuing task in that, after failure, the agent always gets a reward of $0$ at each time-step ad infinitum.

From the book:

we could treat pole-balancing as a continuing task, using discounting. In this case the reward would be -1 on each failure and zero at all other times. The return at each time would then be related to $-\gamma^K$, where $K$ is the number of time steps before failure.

(Here I have used $\gamma$ as the discount factor).

Said another way, assuming the agent fails in the (K + 1)th step the reward is $0$ till that step, $-1$ for it, and then $0$ for eternity.

So the return: $$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... + \gamma^K R_{t+K+1} + ... = -\gamma^K$$

nbro
  • 42,615
  • 12
  • 119
  • 217
Bijay Gurung
  • 131
  • 2
1

The key is that reinforcement learning through something like, say, SARSA, works by splitting up the state space into discrete points, and then trying to learn the best action at every point.

To do this, it tries to pick actions that maximize the reward signal, possibly subject to some kind of exploration policy like epsilon-greedy.

In cart-pole, two common reward signals are:

  1. Receive 1 reward when the pole is within a small distance of the topmost position, 0 otherwise.
  2. Receive a reward that linearly increases with the distance the pole is off the ground.

In both cases, an agent can continue to learn after the pole has fallen: it will just want to move the poll back up, and will try to take actions to do so.

However, an offline algorithm wouldn't update its policy while the agent is running. This kind of algorithm wouldn't benefit from a continuous task. An online algorithm, on contrast, updates its policy as it goes, and has no reason to stop between episodes, except that it might become stuck in a bad state.

John Doucette
  • 9,452
  • 1
  • 19
  • 52
1

From Sutton & Barto's book (p. 56)

Example 3.4: Pole-Balancing The objective in this task is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over: A failure is said to occur if the pole falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical after each failure. This task could be treated as episodic, where the natural episodes are the repeated attempts to balance the pole. The reward in this case could be $+1$ for every time step on which failure did not occur, so that the return at each time would be the number of steps until failure. In this case, successful balancing forever would mean a return of infinity. Alternatively, we could treat pole-balancing as a continuing task, using discounting. In this case the reward would be $-1$ on each failure and zero at all other times. The return at each time would then be related to $-\gamma^{K-1}$, where $K$ is the number of time steps before failure (as well as to the times of later failures). In either case, the return is maximized by keeping the pole balanced for as long as possible.

nbro
  • 42,615
  • 12
  • 119
  • 217
mimoralea
  • 111
  • 2
0

In this case in Sutton & Barto, the authors are talking about removing the episode termination. They then treat the falling pole being reset to a given (originally the starting) state distrubution as a transition with a negative reward within a longer continuing problem. This is a change to the environment descripion, and it comes with some requirements, such as needing to use discounting*.

This is different to the "absorbing state" treatment used elsewhere to put episodic and continuing tasks on the same mathematical footing.

You might use this view of a problem in any environment where the goal is to maintain a steady state. It is in part motivated by the fact that even in the episodic framing of the problem, a perfect agent would never end an episode. However, converting an episodic problem to a continuing one with resets (from limited set of transitions) as part of state transition and reward scheme may be reasonable for other purposes too.


* Or the average reward setting, although they have not covered that option at that point in the book.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66