3

I was wondering how can we deal with movement inertia in an environment that is constantly changing?

Imagine that you make a step on an environment that moves a ball. When you make the step, you make the ball move and at one time, it returns an observation and a reward. Then you give the observation and the reward to your RL algorithm and you find the next best policy.

While calculating the best policy, the ball is still moving in the environment because it has an inertia made by the action we took previously. How do we deal with that? The policy that we just calculated is already based on an old value right?

If the RL algorithm and the communication between the RL and the environment is fast I don't think it is much of an issue, but what if your algorithm is very slow ?

CyDevos
  • 145
  • 6

2 Answers2

6

The simplest answer is that the inertia (or velocity, if the mass of the ball is not a variable) should be part of the observable state that the agent has access to. In RL there are usually two parts to outcome from actions - one is any immediate reward, and the other is changes to state that may impact future actions and rewards. Provided the agent can track in its state representation all factors that could impact future rewards, then most RL algorithms will work as intended (some will be better than others depending on what parts of the problem are hard to solve).

A common hack to representing velocity is to "frame stack" multiple static observations as the current state. This was used for example in the original paper that used DQN to solve Atari games.

If the RL algorithm and the communication between the RL and the environment is fast I don't think it is much of an issue, but what if your algorithm is very slow ?

The basic MDP model used in RL is designed around a situation where you have some control over what the time steps are. If training is done in a real physical environment, and you are not able to semi-reliably sample observations and act with a fixed cadence (e.g. ten times per second), then you may have problems. A lot of game-based RL ignores this issue, runs faster than real time anyway, and only allows the game emulator to move forward in time when it is ready to act and collect more training data.

If you really are in a real time environment, outside of a simulator, where events happen too fast to reliably train, then you may have issues. One possible solution to look into is real-time RL (link chosen randomly, but hopefully gives you some leads) where time steps can be more flexible, although if your agent is really slow compared to the use case, results will be limited by response time - potentially making it infeasible to make it optimal. You can also use an agent that only implements a current behaviour policy, and run training updates based on results asynchronously on a separate device, so you would not be limited by training speed, only by inference speed that a production-ready agent would be restricted by too.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
4

In RL usually a "discretized" time MDP is used in case of continuous time problems, with the idea that if we evaluate enough time the policy, we can approximate it the continuous case. This is also motivated by the fact that we have no idea how to handle continuous case scenarios even software-wise, as even videos are just a fast-enough sequence of images.

At this point, if your policy is fast enough (is not a LLM that takes minutes to generate an answer), you can just go and evaluate the policy at each timestep (if images, the video framerate at most)

You can make a point about the fact that during training you can wait the policy to evaluate and then pass to the next frame, where in real life you cannot, but at that point it's on you to develop a faster model.

In all the cases, you can at the beginning develop a model, test it in deployment to check it's performance, observe how much it takes to make a decision, and use that in the environment to improve the training.

About the inertia, you have another problem, which is that the first moment cannot be estimated with a "single point", so you'll have to provide your model at least the previous state so that it's able to estimate it

Alberto
  • 2,863
  • 5
  • 12