Reinforcement Learning for an environment that is non-markovian

Question

I will start working on a project where we want to optimize the production of a chemical unit through reinforcement learning approach. From the SME's, we already obtained a simulator code that can take some input and render us the output. A part of our output is our objective function that we want to maximize by tuning the input variables. From a reinforcement learning angle, the inputs will be the agent actions, while the state and reward can be obtained from the output. We are currently in the process of building a RL environment, the major part of which is the simulator code described above.

We were talking to a RL expert and she mentioned that one of the thing that we have here conceptually wrong is that our environment will not have the Markov property in the sense that it is really a 'one-step process' with the process not continuing from the previous state and there is no sort of continuity in state transitions. She is correct there. This made me think, how can we get around this then. Can we perhaps append some part of the current state to the next state etc. More importantly, I have seen RL applied to optimal control in other examples as well which are non-markovian ex. scheduling, tsp problems, process optimization etc. What is the explanation in such cases? Does one simply assumes process to be markovian with unknown transition function?

score 2 · Accepted Answer · answered Mar 26 '21 at 05:23

2

RL is currently being applied to environments which are definitely not markovian, maybe they are weakly markovian with decreasing dependency.

You need to provide details of your problem, if it is 1 step then any optimization system can be used.

answered Mar 26 '21 at 05:23

FourierFlux

847
1
7
17

score 0 · Answer 2 · answered Mar 12 '22 at 10:50

I view it as a generalization of the conditional Markovian case. It does have the Markov property, in that the future state depends solely on the input at the given state, which probably is to be sampled from a stochastic policy, that is conditioned on the current state.

It seems to me to be a more general, simpler, and unconstrained case.

Reinforcement Learning for an environment that is non-markovian

2 Answers2