Questions tagged [self-play]

For questions about the technique of "self play" in reinforcement learning in relation to combinatorial games and games in general.

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

AlphaGo Zero: Learning from scratch (DeepMind)

Algorithmic Game Theory (wikipedia)

16 questions
6
votes
2 answers

Why does self-playing tic-tac-toe not become perfect?

I trained a DQN that learns tic-tac-toe by playing against itself with a reward of -1/0/+1 for a loss/draw/win. Every 500 episodes, I test the progress by letting it play some episodes (also 500) against a random player. As shown in the picture…
6
votes
0 answers

How to correctly implement self-play with DQN?

I have an environment where an agent faces an equal opponent, and while I've achieved OK performance implementing DQN and treating the opponent as a part of the environment, I think performance would improve if the agent trains against itself…
6
votes
0 answers

How exactly does self-play work, and how does it relate to MCTS?

I am working towards using RL to create an AI for a two-player, hidden-information, a turn-based board game. I have just finished David Silver's RL course and Denny Britz's coding exercises, and so am relatively familiar with MC control, SARSA,…
4
votes
1 answer

Are multi agent or self-play environments always automatically POMDPs?

As part of my thesis, I'm working on a zero sum game with RL to train an agent. The game is a real-time game, a derivation of pong, one could imagine playing pong with both sides being foosball rods. As I see it, this is an MDP with perfect…
4
votes
1 answer

In AlphaZero, which policy is saved in the dataset, and how is the move chosen?

I've been doing some research on the principles behind AlphaZero. Especially this cheat sheet (1) and this implementation (2) (in Connect 4) were very useful. Yet, I still have two important questions: How is the policy network updated? In (2),…
4
votes
0 answers

How to deal with nonstationary rewards in asymmetric self-play reinforcement learning?

Suppose we're training two agents to play an asymmetric game from scratch using self play (like Zerg vs. Protoss in Starcraft). During training one of the agents can become stronger (discover a good broad strategy for example) and start winning most…
Dzugaru
  • 141
  • 1
3
votes
2 answers

How can both agents know the terminal reward in self-play reinforcement learning?

There seems to be a major difference in how the terminal reward is received/handled in self-play RL vs "normal" RL, which confuses me. I implemented TicTacToe the normal way, where a single agent plays against an environment that manages the state…
3
votes
2 answers

How to fight with unstability in self play?

I'm working on a neural network that plays some board games like reversi or tic-tac-toe (zero-sum games, two players). I'm trying to have one network topology for all the games - I specifically don't want to set any limit for the number of available…
3
votes
1 answer

How does MuZero learn to play well for both sides of a two-player game?

I'm coding my own version of MuZero. However, I don't understand how it supposed to learn to play well for both players in a two-player game. Take Go for example. If I use a single MCTS to generate an entire game (to be used in the training stage),…
Ziofil
  • 128
  • 7
3
votes
1 answer

Given these two reward functions, what can we say about the optimal Q-values, in self-play tic-tac-toe?

This corresponds to Exercise 1.1 of Sutton & Barto's book (2nd edition), and a discussion followed from this answer. Consider the following two reward functions Win = +1, Draw = 0, Loss = -1 Win = +1, Draw or Loss = 0 Can we say something about…
2
votes
2 answers

Generalising performance of Q-learning agent through self-play in a two-player game (MCTS?)

I'm using Q-learning (off-policy TD-control as specified in Sutton's book on pg 131) to train an agent to play connect four. My goal is to create a strong player (superhuman performance?) purely by self-play, without training models against other…
Toekan
  • 23
  • 2
1
vote
1 answer

How can I oppose two AI agents with keras / tensoflow?

I am trying to use tensorflow / keras to play a text based game. The game opposes two players that play by answering questions by choosing an answer among the proposed ones. Game resembles this: Questions asked from player 1, choose value {0, 1,…
1
vote
1 answer

Is neural fictitious self play violating off-policy theorem

I was reading the NFSP player from D. Silver, and I'm somewhat confused by the algorithm: In particular, given that we sample an action according to best response ($\sigma = \epsilon-\text{greedy}(Q)$), we also insert this transition in…
Alberto
  • 2,863
  • 5
  • 12
1
vote
1 answer

Reproducing AlphaZero/MuZero: Failed to beat initial model in arena

I am trying to reproduce AlphaZero's algorithm on the board game Carcassonne. Since I want to use the final game score differences (i.e. victory point of player 1 - victory point of player 2) as the final and only reward, AlphaZero's UCB score can…
1
vote
0 answers

What does self-play in reinforcement learning lead to?

Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for…
stoic-santiago
  • 1,201
  • 9
  • 22
1
2