How can rewards and loss calculation be extended to multiple agents in a vanilla policy gradient RL setting?

Question

Say I have a simple multi-agent reinforcement learning problem using vanilla policy gradient methods (i.e. REINFORCE) that is currently running with one network per agent. If I can say that each of my agents:

are all of the same class
have ~equivalent environmental contexts (on average)
have no privileged state relative to other agents
performs updates equally to all other agents
DO use LSTMs (but store and reset memory states separately)
DO receive rewards for individual actions based on their individual states

...is it possible to use one network for all agents so as to minimize training time? And if so, how do I combine rewards and generate losses? For instance, if I calculate loss as the -logprob * reward (in the REINFORCE case), could I simply sum or average this over all agents and then backprop accordingly?

mikkola · Accepted Answer · 2022-09-13T14:24:04.437

Yes, this can be done and is widely applied in recent literature on multi-agent RL, at least with the collaborative setting where agents are optimizing a shared reward. This is also known as parameter sharing between agents. Note that, even if the agents share the parameters in their policy networks, you can still get different behavior if the agents get different inputs, e.g., perceive different observations, or have different LSTM states.

In terms of the loss function, no changes are required compared to the case where each agent has its own network with private parameters, as long as you make sure the network with shared parameters correctly processes the individual inputs of each agent.

A concrete example is mentioned here - see Appendix C. They also use the trick of appending a one-hot encoding of the agent index with the observation to enable learning of policies that differ based on agent identity. Or, if you prefer to see an implementation of the idea in code, check out how agents are defined in this repository.

How can rewards and loss calculation be extended to multiple agents in a vanilla policy gradient RL setting?

1 Answers1