I trained a DQN that learns tic-tac-toe by playing against itself with a reward of -1/0/+1 for a loss/draw/win. Every 500 episodes, I test the progress by letting it play some episodes (also 500) against a random player.
As shown in the picture…
I have an environment where an agent faces an equal opponent, and while I've achieved OK performance implementing DQN and treating the opponent as a part of the environment, I think performance would improve if the agent trains against itself…
I am working towards using RL to create an AI for a two-player, hidden-information, a turn-based board game. I have just finished David Silver's RL course and Denny Britz's coding exercises, and so am relatively familiar with MC control, SARSA,…
As part of my thesis, I'm working on a zero sum game with RL to train an agent.
The game is a real-time game, a derivation of pong, one could imagine playing pong with both sides being foosball rods.
As I see it, this is an MDP with perfect…
I've been doing some research on the principles behind AlphaZero.
Especially this cheat sheet (1) and this implementation (2) (in Connect 4) were very useful.
Yet, I still have two important questions:
How is the policy network updated? In (2),…
Suppose we're training two agents to play an asymmetric game from scratch using self play (like Zerg vs. Protoss in Starcraft). During training one of the agents can become stronger (discover a good broad strategy for example) and start winning most…
There seems to be a major difference in how the terminal reward is received/handled in self-play RL vs "normal" RL, which confuses me.
I implemented TicTacToe the normal way, where a single agent plays against an environment that manages the state…
I'm working on a neural network that plays some board games like reversi or tic-tac-toe (zero-sum games, two players). I'm trying to have one network topology for all the games - I specifically don't want to set any limit for the number of available…
I'm coding my own version of MuZero. However, I don't understand how it supposed to learn to play well for both players in a two-player game.
Take Go for example. If I use a single MCTS to generate an entire game (to be used in the training stage),…
This corresponds to Exercise 1.1 of Sutton & Barto's book (2nd edition), and a discussion followed from this answer.
Consider the following two reward functions
Win = +1, Draw = 0, Loss = -1
Win = +1, Draw or Loss = 0
Can we say something about…
I'm using Q-learning (off-policy TD-control as specified in Sutton's book on pg 131) to train an agent to play connect four. My goal is to create a strong player (superhuman performance?) purely by self-play, without training models against other…
I am trying to use tensorflow / keras to play a text based game. The game opposes two players that play by answering questions by choosing an answer among the proposed ones.
Game resembles this:
Questions asked from player 1, choose value {0, 1,…
I was reading the NFSP player from D. Silver, and I'm somewhat confused by the algorithm:
In particular, given that we sample an action according to best response ($\sigma = \epsilon-\text{greedy}(Q)$), we also insert this transition in…
I am trying to reproduce AlphaZero's algorithm on the board game Carcassonne. Since I want to use the final game score differences (i.e. victory point of player 1 - victory point of player 2) as the final and only reward, AlphaZero's UCB score can…
Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for…