I have the following RL model that I want to train (see the diagram below). My idea is to have two agents: agent A and agent B. Agent A observes the input I1 and decides an action action1, then immediately, agent B observes the input (action1, I2) where I2=f(action1, I1) is obtained with a known function f(.). Next, agent B decides the action action2. The two actions action1 and action2 are used to compute the reward, which is common for the two agents. The next state of agent A will be obtained I1' and the process continues.
--------------------------- next state --------------------------------------
| |
I1 -> |A| -> action1 ------------------------>|B| -> (action1, action2) -> reward
| | |
| | |
-----------------> f(I1, action1) -> I2 ------
I want to know if is it possible to implement a DRL algorithm for this model, or more importantly, if there is a similar two-agents model that is proposed that I can use?