4

The book from Sutton and Barto, Reinforcement Learning: An Introduction, define a model in Reinforcement Learning as

something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave.

In this answer, the answerer makes a distinction:

There are broadly two types of model:

  • A distribution model which provides probabilities of all events. The most general function for this might be $p(r,s'|s,a)$ which is the probability of receiving reward $r$ and transitioning to state $s'$ given starting in state $s$ and taking action $a$.

  • A sampling model which generates reward $r$ and next state $s'$ when given a current state $s$ and action $a$. The samples might be from a simulation, or just taken from history of what the learning algorithm has experienced so far.

The main difference is that in sampling models I only have a black box, which, given a certain input $(s,a)$, generates an output, but I don't know anything about the probability distributions of the MDP. However, having a sampling model, I can reconstruct (approximately) the probability distributions by running thousands of experiments (e.g. Monte Carlo Tree Search).

On the other hand, if I have a distribution model, I can always sample from it.

I was wondering if

  1. what I wrote is correct;

  2. this distinction has been remarked in literature and where I can find a more in-depth discussion on the topic;

  3. someone has ever separated model-based algorithms which use a distribution model and model-based algorithms which use only a sampling model.

nbro
  • 42,615
  • 12
  • 119
  • 217
A. Pesare
  • 141
  • 4

1 Answers1

1

I think that your description is roughly correct, but I wouldn't call a "sampling model" a "model" because it doesn't necessarily model something, unless, for example, you are first learning in simulation to later be able to act in the real-world or environment (in this sense, the simulation would be a model of the real environment, but this does not have to be the case, i.e. you may just want to act in the simulation (e.g. Atari games)), or, alternatively, when it's really a model of the MDP, but, in that case, you can just call it a model estimate.

So, you can call it a

  • sampling function, in case you sample e.g. from the experience replay
  • environment function, in case $r$ and $s'$ are returned by the environment,
  • model estimate, in case it's an estimate of $p(s' \mid s, a)$ (people may consider an experience replay a model estimate or, at least, information that can be used to build a model estimate)

The important thing to keep in mind is that, if you want to take an action $a$ in a certain state $s$, you need a function that returns you a reward $r$ and next state $s'$, if you want to do reinforcement learning.

I don't know if this distinction has been emphasized in the literature, but, as you noted, you can learn/estimate a (transition) model by exploring the world. I had asked a related question here a few years ago. You can also estimate the reward function, which is sometimes incorporated in the "model" of the environment, which, in this case, is denoted as $p(s', r \mid s, a)$ rather than just $p(s' \mid s, a)$, but these terms can be written as a function of each other.

People may also confuse this environment function with an exploratory policy, as they are, in a way, both used for exploration, but I think the concepts are distinct enough, as an exploratory policy is a way of deciding how to act given your current knowledge or ignorance: the exploratory policy can be viewed as a way of exercising/calling the environment function.

nbro
  • 42,615
  • 12
  • 119
  • 217