1

In single-agent RL (as in Sutton & Barto's framework), the expected return is defined from the current state $s_t$:

$$ G_t = R_t + R_{t+1} + .... $$

In the Bellman equation, $V(s_t)$ depends only on the current state and future rewards:

$$ V(s_t) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma V(s_{t+1}) \mid s_t \right]. $$

However, in Multi-Agent RL (e.g., POSGs), the expected return for agent $i$ is defined over full histories $\hat{h}^t$:

$$ U_i(\pi) = \sum_{\hat{h}^t \in \hat{H}} \Pr(\hat{h}^t \mid \pi) \, u_i(\hat{h}^t), \quad \text{(Eq. 4.2, [MARL Book][1])} $$

where $\hat{h}^t$ includes states, joint actions, and observations up to time $t$.

  1. Why does MARL need to explicitly account for \textit{full histories} in expected returns, while single-agent RL can rely solely on the current state?
  2. Is it corrent to understand that there is a time frame difference in the definitions of expected return in single-agent RL and MARL? (MARL’s expected return integrate over the entire history (time steps $0$ to $t$), while single-agent RL only considers future returns (time $t$ onward))?

0 Answers0