Is there a fundamental difference between an environment being stochastic and being partially observable?

Question

In AI literature, deterministic vs stochastic and being fully-observable vs partially observable are usually considered two distinct properties of the environment.

I'm confused about this because what appears random can be described by hidden variables. To illustrate, take an autonomous car (Russel & Norvig describe taxi driving as stochastic). I can say the environment is stochastic because I don't know what the other drivers will do. Alternatively, I can say that the actions of the drivers are determined by their mental state which I cannot observe.

As far as I can see, randomness can always be modeled with hidden variables. The only argument I came up with as to why the distinction is necessary is Bell's inequality, but I don't think that AI researchers had this in mind.

Is there some fundamental difference between stochasticity and partial observability or is this distinction made for practical reasons?

nbro · Accepted Answer · 2021-12-25T22:02:11.213

I think the distinction is made more for conceptual reasons, which has practical implications, so let me review the usual definitions of a stochastic and partially observable environment.

A stochastic environment can be modeled as a Markov Decision Process (MDP) or Partially Observable MDP (POMDP). So, an environment can be

stochastic and partially observable
stochastic and fully observable

The stochasticity refers to the dynamics of the environment and, more specifically, to how the environment stochastically moves from one state to the other after an action is taken (basically, a Markov chain with actions and rewards). In other words, in a stochastic environment, we have the distribution $p(s' \mid s, a)$ (or, in some cases, the reward is also included $p(s', r \mid s, a)$). If $p(s' \mid s, a)$ gave a probability of $1$ to one of the states and $0$ to all other states, we would have a deterministic environment.

The partial observability refers to the fact that we don't know in which state the agent is, so we can think of having or maintaining a probability distribution over states, like $b(s)$. So, in the case of POMDP, we not only are uncertain about what the next state $s'$ might be after we have taken $a$ in our current state $s$, but we are not even sure about what $s$ currently is.

So, the difference is made so that we can deal with uncertainties about different parts of the environment (dynamics and actual knowledge of the state). Think about a blind guy that doesn't have the full picture (I hope this doesn't offend anyone) and think about a guy that sees well. The guy that sees well still isn't sure about tomorrow (maybe this is not a good example as you can argue that this is also due to the fact that the guy that sees well doesn't know the full state, but I hope this gives you the intuition).

Of course, this has practical implications. For example, it seems that you cannot directly apply the solutions that you use for MDPs to POMDPs. More precisely, for an MDP, if you learn the policy $\pi(a \mid s)$, i.e. a probability distribution over actions given states, if you don't know the state you are in, this policy is quite useless.

To deal with the uncertainty about the state the agent is in, in POMDPs, we also have the concept of an observation, which is the information that the agent gathers from the environment about the current state (e.g., in the example of a blind guy, the observations would be the sounds, touch, etc.), in order to update its belief about the current state. In practice, some people tried to apply the usual RL algorithms for MDPs to POMDPs (see e.g. DQN or this), but they made a few approximations, which turned out to be useful and successful.

If the difference wasn't still clear, just take a look at the equation that can be used to relate the belief state and the transition model (dynamics) of the environment

$$ \underbrace{b^{\prime}\left(s^{\prime}\right)}_{\text{Next belief state}}=\alpha \underbrace{P\left(o \mid s^{\prime}\right)}_{\text{Probability of observation }o \text{ given }s'} \sum \underbrace{P\left(s^{\prime} \mid s, a\right)}_{\text{Transition}\\ \text{model}} \underbrace{b(s)}_{\text{Previous belief state}} $$

So, in a POMDP, the policy, as stated above, in theory, cannot depend on $s$, but needs to depend on $b(s)$, the belief state, i.e. a probability distribution over states.

If this answer wasn't still satisfactory, although you probably already did it, you should read the section 2.3.2 Properties of task environments of the AIMA book (3rd edition). Their description of stochastic and partially observable environment seems to be consistent with what I wrote here, but maybe their description of a stochastic environment is not fully clear, because they say

If the next state of the environment is completely determined by the current state and the action executed by the agent, then we say the environment is deterministic; otherwise, it is stochastic

The unclear part is completely determined. They should have said deterministically determined (which you can use for a rap song).

However, they later clarify their definition by saying

our use of the word "stochastic" generally implies that uncertainty about outcomes is quantified in terms of probabilities

In addition to that, they call an environment that is either stochastic or partially observable uncertain. It makes sense to do this because uncertainty makes the problems harder, so we can differentiate between certain and uncertain environments.

To be honest, I don't know if there's some kind of mathematical formalism that doesn't differentiate between stochastic or partially observable environments, but I am not sure how useful it might be.

score 8 · Answer 2 · answered Dec 24 '21 at 13:36

A few points I'd like to add (without repeating the info already provided by nbro's answer):

I think you're half-right, in that indeed we can probably always model randomness as hidden information (e.g., as the hidden random seed in a software implementation of an environment). However, the other way around does not work; we can not always model any partially observable environment as a stochastic one!

So we probably have a subset relation here, not an equivalence relation. Personally, I often find stochastic environments to be somehow "easier" to handle than partially observable ones. So, it is generally beneficial to simply treat them as such, rather than unnecessarily casting them into the often-more-difficult format of partially observable environments.

In stochastic (but fully observable) environments, there always exists an optimal deterministic policy, but in partially observable environments an optimal policy may need to be non-deterministic. As a corollary, I would say that this indeed implies that there really is some fundamental difference between the two.

If an environment is just stochastic, but fully observable, non-deterministic policies may still also be optimal, but there is also always guaranteed to be at least one fully deterministic policy; a policy $\pi$ that, for any state $s$ assigns a probability of $\pi(a \mid s) = 1$ to just one action $a$ (and $0$ to any other action for that same state). Sometimes it may be fine to be indifferent (and distribute the probability mass over more than a single action), but this is never strictly required for optimality.

In a partially-observable environment, it is possible that an optimal policy must be non-deterministic. Consider this "drawing" of an environment, consisting of a few squares, with the current position of the agent labelled as $A$, and the position of the goal labelled as $G$. The possible actions are to move either left or right.

$$\square A \square \square G \square \square \square$$

Suppose that this environment is partially observable to the extreme extent that the agent never has any idea where it is, i.e. all states are aliased (look the same to the agent). A deterministic policy would then always go left or always go right, but, depending on whether the agent is currently to the left or to the right of the goal, one of those two deterministic policies would never reach the goal. And the agent has no way at all to tell which one would be the bad policy. The optimal policy in this environment is to simply go left with probability $0.5$, and go right with probability $0.5$. Eventually, the agent will get lucky and end up in the goal position. The partially observable nature of this environment really makes it necessary to follow a stochastic policy.

In partially observable environments, we often consider the possibility that there can be actions that allow us to obtain new information. These are not necessarily actions that directly lead to any rewards or future returns, but really only allow us to observe things that we could not observe before (and hence possibly allow us to, with more certainty, follow a better policy afterwards). This idea does not really exist in fully observable stochastic environments.

In multi-agent environments (consider, for example, many card games), other agents (often opponents but could also be agents with similar goals to our own) may have access to different information/observations from our own agent. This means that, for example through counterfactual reasoning or even explicit communication, we may gain full information or at least update our beliefs with respect to unobservable (to us) parts of the state. For example, based on the actions of an opponent in a card game, we may infer that it is very likely or very unlikely for them to have certain cards, because otherwise they very likely would not have acted in the way that they did. This sort of reasoning does not apply to fully observable stochastic environments.

Is there a fundamental difference between an environment being stochastic and being partially observable?

2 Answers2

Linked