3

In OpenAI Gym "reward" is defined as:

reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

I am training Hindsight Experience Replay on Fetch robotics environments, where rewards are sparse and binary indicating whether or not the task is completed. The original paper implementing HER uses success rate as a metric in its plots, like so:

enter image description here

On page 5 of the original paper, it is stated that the reward is binary and sparse.

When I print the rewards obtained during a simulation of FetchReach-v1 trained with HER, I get the following values. The first column shows the reward and the second column shows the episode length.

enter image description here

As can be seen, at every time step, I am getting a reward, sometimes I get a $-1$ reward at every time step throughout the episode for a total of $-50$. The maximum reward I can achieve throughout the episode is $0$.

Therefore my question is: What is the reward obtained at each time-step? What does it represent and how is this different from the success rate?

nbro
  • 42,615
  • 12
  • 119
  • 217
rrz0
  • 273
  • 2
  • 7

1 Answers1

1

Page 6 of the paper describes the exact reward functions, and why they were used:

Goals: Goals describe the desired position of the object (a box or a puck depending on the task) with some fixed tolerance of $\epsilon$ i.e. $G = \mathcal{R}^3$ and $f_g(s) = [|g − s_{object}| ≤ \epsilon]$, where $s_{object}$ is the position of the object in the state s. The mapping from states to goals used in HER is simply $m(s) = s_{object}$.

Rewards: Unless stated otherwise we use binary and sparse rewards $r(s, a, g) = −[f_g(s 0 ) = 0]$ where $s'$ is the state after the execution of the action a in the state s. We compare sparse and shaped reward functions in Sec. 4.4.

So, at least in the base version (which I believe is your fetchreach-v1), the agent receives a reward of -1 for every timestep spent more than $\epsilon$ from the goal state, and a reward of 0 for every timestep spent within $\epsilon$ of the goal state. Thus, a score of -5.0 would seem to correspond to the agent moving directly to the goal and staying there, while a score of -50.0 would correspond to the agent failing to reach the goal state entirely.

John Doucette
  • 9,452
  • 1
  • 19
  • 52