How does DeepMind perform reinforcement learning on a TPU?

Question

I've watched this video of the recent contest of AlphaStar Vs Pro players of StarCraft2, and during the discussion David Silver of DeepMind said that they train AlphaStar on TPUs.

My question is, how is it possible to utilise a GPU or TPU for reinforcement learning when the agent would need to interact with an environment, in this case is the StarCraft game engine?

At the moment with my training of a RL agent I need to run it on my CPU, but obviously I'd love to utilise the GPU to speed it up. Does anyone know how they did it?

Here's the part where they talk about it, if anyone is interested:

https://www.youtube.com/watch?v=cUTMhmVh1qs&t=7030s

My first guess would be that they run the Starcraft engine on that TPU. — John Dvorak, Feb 01 '19 at 10:25
As I understand it, a TPU isn't a self-contained computer to be able to load an environment like StarCraft onto it? Maybe you can, I don't know. I thought it was essentially like a GPU architecture but with an higher number of cores. I just can't think how you could put a game environment on it. — BigBadMe, Feb 01 '19 at 10:56
I do know you can implement a full-blown raytracer in GPU code (specifically, CUDA). As for utilizing parallelization, Blender renders each pixel in a tile at the same time, and DeepMind should be able to use the same trick - each core running one instance of the game - if they use neural evolution rather than gradient descent - and I have no idea how you would gradient-descend a Starcraft AI. Just guesswork on my side though, I don't know how TPUs actually work. — John Dvorak, Feb 01 '19 at 14:03
TPUs are hardware accelerators. There is no way to "run" a game on an accelerator, which as @BigBadMe pointed out, is not a computer. The game would need to be heavily customised to make use of TPUs as a form of accelerator - this is most certainly not the case. — MasterScrat, Feb 27 '20 at 14:08
Starcraft is not differentiable, so you can't learn to play using gradient descent directly, which is why they use reinforcement learning. — MasterScrat, Feb 27 '20 at 14:13

score 3 · Answer 1 · answered Feb 01 '19 at 16:10

In their blog post, they link to (among many other papers) their IMPALA paper. Now, the blog post only links to that paper with text implying that they're using the "off-policy actor-critic reinforcement learning" described in that paper, but one of the major points of the IMPALA paper is actually an efficient, large-scale, distributed RL setup.

So, until we get more details (for example in their paper that's currently under review), our best guess would be that they're also using a similar kind of distributed RL setup as described in the IMPALA paper. As depicted in Figures 1 and 2, they decouple actors (machines running code to generate experience, e.g. by playing StarCraft) and learners (machines running code to learn/train/update weights of neural network(s)).

I would assume that their TPUs are definitely being used by the Learner (or, likely, multiple Learners). StarCraft 2 itself won't benefit from running on TPUs (and probably would be impossible to even get to run on them in the first place), because the game logic likely doesn't depend on large-scale, dense matrix operations (the kinds of operations that TPUs are optimized for). So, the StarCraft 2 game itself (which only needs to run for the "actors", not for the "learners") is almost certainly running on CPUs.

The actors will still have to run forwards passes through Neural Networks in order to select actions. I would assume that their Actors are still equipped with either GPUs or TPUs to do this more quickly than a CPU would be capable of, but the more expensive backwards passes are not necessary here; only the Learners need to perform those.

How does DeepMind perform reinforcement learning on a TPU?

1 Answers1