For questions about the MuZero algorithm proposed in the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (2019).
Questions tagged [muzero]
14 questions
5
votes
2 answers
Is it practical to train AlphaZero or MuZero (for indie games) on a personal computer?
Is it practical/affordable to train an AlphaZero/MuZero engine using a residential gaming PC, or would it take thousands of years of training for the AI to learn enough to challenge humans?
I'm having trouble wrapping my head around how much…
Luke W
- 53
- 3
5
votes
1 answer
Do AlphaZero/MuZero learn faster in terms of number of games played than humans?
I don't know much about AI and am just curious.
From what I read, AlphaZero/MuZero outperform any human chess player after a few hours of training. I have no idea how many chess games a very talented human chess player on average has played before…
220284
- 153
- 4
3
votes
1 answer
How is MuZero's second binary plane for chess defined?
From the MuZero paper (Appendix E, page 13):
In chess, 8 planes are used to encode the action. The first one-hot plane encodes which position the piece was moved from. The next two planes encode which position the piece was moved to: a one-hot…
MuZeroFm
- 31
- 2
3
votes
1 answer
How does MuZero learn to play well for both sides of a two-player game?
I'm coding my own version of MuZero. However, I don't understand how it supposed to learn to play well for both players in a two-player game.
Take Go for example. If I use a single MCTS to generate an entire game (to be used in the training stage),…
Ziofil
- 128
- 7
2
votes
1 answer
Why does training a NN using MCTS work even if the number of simulations isn't much larger than the number of actions?
tl;dr If the visit rates of children generated by MCTS is biased because not enough samples were taken, why doesn't the network learn random behavior?
My understanding of combining MCTS and NNs (e.g. for MuZero) is that we start with an untrained…
Christopher
- 133
- 4
2
votes
2 answers
How are NNs output setup for games that allow multiple actions each turn and have very large sets of possible actions?
I was looking at an AI coding challenge for a two player game on a 2D grid of variable size (from one game to the next).
Here is a screen shot example of the playfield.
Each player has multiple units on the board. In fact, each tile can hold…
snowfrogdev
- 121
- 3
2
votes
1 answer
In the MuZero paper, how does backprop in the MCTS account for the immediate reward from each edge?
On page 12 of this paper: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, it describes how MCTS works for the MuZero algorithm. It states in equation 4 that during the 'backup' after a simulation, the mean value (Q) for every…
Matrix001
- 123
- 4
2
votes
1 answer
How to choose the first action in a Monte Carlo Tree Search?
I'm working on reimplementing the MuZero paper. In the description of the MCTS (page 12), they indicate that a new node with associated state $s$ is to be initialized with $Q(s,a) = 0$, $N(s,a) = 0$ and $P(s,a) = p_a$. From this, I understand that…
Ziofil
- 128
- 7
1
vote
1 answer
Batch wise Inference to speed up Muzero's MCTS
Context:
I've implemented Muzero for the game Tic-tac-toe.
Unfortunately, the self-play and training is very slow (like 10 hours until it plays quite well).
I ran the python profiler to find the parts that take the most time.
The result is that…
Lynix
- 33
- 3
1
vote
0 answers
RL agent focusses too much on early rewards, even with no discounting
How can I guide my RL agent to solve tasks in the correct order?
I'm trying to train an agent using reinforcement learning, similar to MuZero. The goal is to solve 4 tasks, A/B/C/D. Each task involves two actions, X1/X2. Initially, only action A1 is…
Christopher
- 133
- 4
1
vote
1 answer
Reproducing AlphaZero/MuZero: Failed to beat initial model in arena
I am trying to reproduce AlphaZero's algorithm on the board game Carcassonne. Since I want to use the final game score differences (i.e. victory point of player 1 - victory point of player 2) as the final and only reward, AlphaZero's UCB score can…
TommyX
- 13
- 3
1
vote
1 answer
Scrabble rack observation with MuZero
Currently I'm trying to implement Scrabble with MuZero.
The $15 \times 15$ game board observation (as input) is of size $27 \times15 \times15$ (26 letters + 1 wildcard) with a value of 0 or 1.
However I'm having difficulties finding a suitable way…
Thrusticy
- 11
- 1
1
vote
1 answer
Is which sense was AlphaGo "just given a rule book"?
I was told that AlphaGo (or some related program) was not explicitly taught even the rules of Go -- if it was "just given the rulebook", what does this mean? Literally, a book written in English to read?
releseabe
- 141
- 1
0
votes
1 answer
Representing scalars as vectors for the network output
In the MuZero Paper in the Appendix F they explain that they represent values and rewards as vectors.
This means that the neural networks don't output the scalars directly,
instead, they output a probability distribution that later gets converted…
Lynix
- 33
- 3