4

I've implemented A2C. I'm now wondering why would we have multiple actors walk around the environment and gather rewards, why not just have a single agent run in an environment vector?

I personally think this will be more efficient since now all actions can be calculated together by only going through the network once. I've done some tests, and this seems to work fine in my test. One reason I can think of to use multiple actors is implementing the algorithm across many machines, in which case we can have one agent on a machine. What else reason should we prefer multiple actors?

As an example of environment vector based on OpenAI's gym

class GymEnvVec:
    def __init__(self, name, n_envs, seed):
        self.envs = [gym.make(name) for i in range(n_envs)]
        [env.seed(seed + 10 * i) for i, env in enumerate(self.envs)]

    def reset(self):
        return [env.reset() for env in self.envs]

    def step(self, actions):
        return list(zip(*[env.step(a) for env, a in zip(self.envs, actions)]))
nbro
  • 42,615
  • 12
  • 119
  • 217
Maybe
  • 471
  • 2
  • 11

2 Answers2

2

I believe if you run a single agent in multiple parallel environments many times you will get similar actions in similar states, the reason behind multiple agents is that you will have different agents with different parameters and you can also have different explicit exploration policies so your exploration will be better and you will learn more from environment (see more state space). With single agent you can't really achieve that, you would have a single exploration policy, single parameter set for the agent and most of the time you would be seeing similar states (at least after a while). You would be speeding up your learning process but that's just because you're running multiple environments in parallel (compared to the regular actor-critic or Q-learning). I think quality of learning would be better with multiple different actors.

Brale
  • 2,416
  • 1
  • 7
  • 15
0

When you run multiple actors you actually make copies of your current agent and run these copies in non-vector environments. The environments are usually also the same.

In the case of Atari, yes of course, using a vectorized environment is much more efficient. But if you want to train an agent to play DotA, or drive a car, then there is no vectorized environment.

Running different actors instead of copies of the same agent is not really possible with on-policy algorithms. You have to collect data using one and the same policy, otherwise it is not on-policy.

A slight exception to this is the case of asynchronous training. In this case you start with the same copies of your agent, but update these copies asynchronously, and only once in a while sync the policies. However, you still have to make sure the actors' policies don't deviate too much from one another. But note that the synchronous approach maybe performs better than the asynchronous. And again, in case you can vectorize your environment I don't really see the need to do this.

pi-tau
  • 995
  • 6
  • 12