How to assign rewards in a non-Markovian environment?

Question

I am quite new to the Reinforcement Learning domain and I am curious about something. It seems to be the case that the majority of current research assumes Markovian environments, that is, future states of the process depend only upon the present state, not on the sequence of events that preceded it. I was curious about how we can assign rewards when the Markovian property doesn't hold anymore. Do the state-of-the-art RL theory and research support this?

John Doucette · Accepted Answer · 2019-11-01T15:10:22.147

Dealing with a Non-Markovian process is unusual in Reinforcement Learning. Although some explicit attempts have been made, the most common approach when confronted with a non-Markovian environment is to try and make the agent's representation of it Markovian.

After reducing Agent's model of the dynamics to a Markovian process, rewards are assigned from the environment in exactly the same way as before. The environment simply sends the agent a reward signal in response to each action.

The Markovian assumption is essentially a formalism of the idea that the future can be predicted from the present. It says that if you know the dynamics of a system, and you know the state of the system now, you know everything you need to predict the state of the system later, and how we got to this state is not important. Formally, we write this as $P(s_t∣s_{t−1:0})=P(s_t∣s_{t−1})$.

That said, the models we use in AI are usually simplifications of the real world. When we simplify the world, we can introduce non-Markovian dynamics. However, if the model grows too complex, and the state space too large, learning will take too long. The goal is then to define a state space that is small enough to be learnable, and not too bad an approximation of the real dynamics of the world. AI researchers have several tools to do this.

As a working example, imagine that the future position of a robot depends mainly on the current position, and current velocity, along with the action the robot takes right now. Using these variables to define a state, we get almost Markovian dynamics, but as the robot moves over time, its battery drains and the movements become very slightly more imprecise. If we wanted to remove this error, we can:

Expand the state variables. If we add "current battery level" to the state, then our process will become Markovian again. This is the most obvious approach, but it only works up to a point. The state space grows exponentially in size as you add new variables. It will quickly become too large to learn, so in many problems, a complex state is subsequently decomposed into simpler sub-states. These simpler states may not depend on one another at all, or may depend only to varying degrees. The main limiting factor in learning to navigate an exponential state space is that the number of parameters in our model will grow exponentially. If the original space had $O(2^n)$ parameters, splitting it in half will yield two seperate learning problems of size $O(2^{n/2} + 2^{n/2} = 2^{n/2 + 1})$. Splitting the problem in two reduces its size by a factor $O(2^{n/2})$, which is big. This is the technique exploited by Dynamic Bayesian Networks. In the case of our robot, we could make current $x$ position depend only on previous $x$ position, previous $x$ velocity, and the robot's action, rather than on all previous variables, and likewise for the other variables.
Use Higher Order Models of the Environment. This is really a generalization of (1) above. Instead of the state being the current location and velocity of the robot, we could define the state to be the current location/velocity and the location/velocity of the robot 1 step in the past (or 2, or 3, or $k$ steps). This increases the size of the state space enormously, but if we don't know why there is an error in the robot's movements, this can still allow us to model it. In this case, if we know the size of the change in the robot's position last time, and we observe that it is smaller this time (for the same action), we can estimate the rate at which the change is changing, without understanding its cause.

As an example, consider the process of setting the price of a good. The agent's reward is non-Markovian, because sales increase or decline gradually in response to price changes. However, they don't depend on all of the history of prices. Imagine that instead they depend on the last 5 prices (or the last k). We can use technique (2) above to expand the agent's model of what a state is. The now the agent learns that when prices have been $p_{t-1:t-5}$ in the last 5 steps, and it sets the price at time $t$ to some value, it's reward is $x$. Since the reward depends only on the prices now, and in the last 5 steps, the agent is now learning a Markovian process, even though the original process is non-Markovian, and the reward function is non-Markovian. No changes are made to the reward function, or the environment, only the agent's model of the environment.

How to assign rewards in a non-Markovian environment?

1 Answers1