I'm working with an environment that can easily be subdivided into two parts, with part 1 have an indirect effect on part 2, but I can't simulate either parts alone in a realistic way.
Also, both parts of the environment are, on their own, non-markovian, and both give a reward to the agent, with a delay between action and observable effect that can be anywhere between 5 and 20-ish steps.
Finally, I need the agent to be able to control both parts of the environment, with observations from part 1 having no effect on the agent's action on part 2, and vice-versa.
My current solution is to use a custom class combining two sb3 SAC agents into one, and instead of using the .learn() method, I have a simple custom learn method that makes both agents do predictions on the environment, combine the predictions, act on the environment, and finally give part 1 rewards to the first agent, and part 2 rewards to the second agent.
This is far from optimal, since I don't really have access any "advanced" sb3 stuff, like callbacks, because I would need to use .learn() with a regular agent to do that. And I can't use .learn because that would make agents learn one after the other, instead of both at the same time.
So, my question is, is there I way to customize a sb3 SAC agent's network to actually be made of two different networks, one that only sees observations from part 1, and one that only sees observations from part 2 ?