I've been trying to train agents to achieve a non-Markovian task in a modified version of PettingZoo's Waterworld. In the task, I have two pursuers (the agents I'm training) and three evaders. I won't go deep into detail, but the task consists in the following: in the first "stage", one pursuer has to catch a specific evader, in the second stage the second pursuer has to catch a(nother) specific evader, and in the final stage the two pursuers have to catch together (i.e., at the same time) the last evader. I provide agents with small rewards every time they achieve one stage of the task, and then give them a larger reward at the end of the task. To make the task non-Markovian, agents don't have access to the current stage of the task.
I've been training agents using the Ray RLlib implementation of PPO and, very surprisingly, the agents eventually manage to achieve the task. I'll be honest in saying that I wasn't really expecting this: admittedly, their performance isn't the greatest, but they still manage to achieve the task, which I thought wouldn't have been possible with (Ray RLlib's) vanilla PPO. I've also tried disabling GAE (as I found a comment on Reddit saying that it can help in dealing with non-Markovianity (?)), but the agents still managed to learn how to complete the task.
My question is the following: assuming that there are no bugs in my implementation, is there an intuitive explanation as to why PPO is able to solve such task? Is it that the task is too simple?