How to correctly train policies in multi-agent RL?

Question

I am diving into Multi-Agent Reinforcement Learning and after reading some literature, I would like to clarify some approaches because I am not quite sure. Now for the following two cases it is clear that:

independent learning: one distinct policy per agent with distinct rewards per agent of the same reward function. $$ \pi_{\text{ag1}}: \text{obs}_{\text{ag1}} \rightarrow \text{a}_{\text{ag1}} \rightarrow \text{r}_{\text{ag1}} $$ $$ \pi_{\text{ag2}}: \text{obs}_{\text{ag2}} \rightarrow \text{a}_{\text{ag2}} \rightarrow \text{r}_{\text{ag2}} $$
joint learning: one (global) policy making decisions for all agents simultaneously with one global reward $$ \pi_{\text{ag}}: [\text{obs}_{\text{ag1}},\text{obs}_{\text{ag2}},\text{obs}_{\text{ag3}}] \rightarrow [\text{a}_{\text{ag1}},\text{a}_{\text{ag2}},\text{a}_{\text{ag3}}] \rightarrow \text{r}_{\text{glob}}=\text{r}_{\text{ag1}} + \text{r}_{\text{ag2}} + \text{r}_{\text{ag3}} $$

But using the CTDE aproach (centralized critic decentralized execution), I am not quite sure how it should be correctly applied. There are the following options for policy and reward that can be implemented and combined:

distinct policy: each agent has its own policy: $\pi_{\text{ag_i}}$
shared policy: each agent uses the same policy: $\pi_{\text{ag}}$
distinct reward values to each agent (by same reward function): $\text{a}_{\text{ag_i}}\rightarrow\text{r}_{\text{ag_i}}$
shared/global reward given to all agents: $\text{a}_{\text{ag_i}}\rightarrow\text{r}_{\text{glob}}=\text{sum}(\text{r}_{\text{ag_i}})$

I further know that it is possible to assign a partial fraction $\alpha$ to the distinct reward in 3). i.e. $\text{r}_{\text{ag_i}} + \alpha\, \text{sum}(\text{r}_{\text{ag_j\i}})$ in order to enhance cooperation.

But what is the "correct way" of learning a coordinated behaviour in MARL with heterogeneous agents, especially when applying CTDE?

My goal is to apply MARL in a game, where the heterogeneous agents have to attack and destroy opponents. Currently I am using CTDE, where agents of the same type are using the same shared policy and they get distinct reward values (of the same reward function). I also tried a shared reward, but did not notice a significant change. I update my policies by PPO, so I don't use an explicit MARL algorithm. I just want to make sure I am going the correct way. It's not the issue with implementation, but rather a question about the whole concept.

How to correctly train policies in multi-agent RL?

0 Answers0