10

I've seen numerous mathematical explanations of reward, value functions $V(s)$, and return functions. The reward provides an immediate return for being in a specific state. The better the reward, the better the state.

As I understand it, it can be better to be in a low-reward state sometimes because we can accumulate more long term, which is where the expected return function comes in. An expected return, return or cumulative reward function effectively adds up the rewards from the current state to the goal state. This implies it's model-based. However, it seems a value function does exactly the same.

Is a value function a return function? Or are they different?

nbro
  • 42,615
  • 12
  • 119
  • 217
user3168961
  • 221
  • 2
  • 6

1 Answers1

7

There is a strong relationship between a value function and a return. Namely that a value function calculates the expected return from being in a certain state, or taking a specific action in a specific state. A value function is not a "return function", it is an "expected return function" and that is an important difference.

A return is a measured value (or a random variable, when discussed in the abstract) representing the actual (discounted) sum of rewards seen following a specific state or state/action pair.

Typically there is no need to express an individual return as a "return function", although you may find many formulae in RL for sampling or estimating specific return values in order to calculate targets or errors for the value function.

A return (or cumulative reward) function effectively adds up the rewards from the current state to the goal state. This implies it's model-based.

If you have a simple MDP, already accurately modelled, where you can calculate expected return directly from that model, then, yes, in theory, that would be a value function. However, this could be more computationally intensive to resolve than dynamic programming (e.g. Policy Iteration or Value Iteration), and in many cases you don't have any such model, but can still apply RL approaches to learn a value function from experience.

nbro
  • 42,615
  • 12
  • 119
  • 217
Neil Slater
  • 33,739
  • 3
  • 47
  • 66