What is the optimal score for Tic Tac Toe for a reinforcement learning agent against a random opponent?

Question

I guess this problem is encountered by everyone trying to solve Tic Tac Toe with various flavors of reinforcement learning.

The answer is not "always win" because the random opponent may sometimes be able to draw the game. So it is slightly less than the always-win score.

I wrote a little Python program to calculate that. Please help verify its correctness and inform me if it has bugs or errors.

@Taw Yes, currently. I may make it an option in the program. — Yan King Yin, Sep 04 '21 at 08:26

score 3 · Answer 1 · answered Sep 04 '21 at 07:20

3

This blog post suggests that when playing against a random opponent, if the agent goes first, the win rate is 97.8%, and if they go second, the win rate is 79.6% (and the rest are draws).

answered Sep 04 '21 at 07:20

Taw

1,321
5
12

Yan King Yin · Answer 2 · 2021-09-05T09:56:05.560

Here I asssume:

Both players avoid illegal moves perfectly
Player X always chooses the move with maximum expectation value
Player O chooses all available moves with equal probability

Result depends on the scoring scheme:

(This scheme is used in one version of Gym Tic Tac Toe)

For win=20, draw=10, lose=-20:

Optimal expectation value =
- X plays first: 19.94791666666666...
- O plays first: 19.164021164021...
For win=20, draw=0, lose=-20:

Optimal expectation value =
- X plays first: 19.89583333333333...
- O plays first: 18.497354497354....

It also helps to verify the program with some pre-played board positions, included in the code.

Here is the program:

import math     # for math.inf = infinity
print("Calculate optimal expectation value of TicTacToe")
print("from the perspective of 'X' = first player.")
print("Assume both players perfectly avoid illegal moves.")
print("Player 'X' always chooses the move with maximum expectation value.")
print("Player 'O' always plays all available moves with equal probability.")
print("You may modify the initial board position in the code.")
Empty board
test_board = 9 * [0]
Pre-moves, if any are desired:
X|O|
O|O|X
X| |
#test_board[0] = -1
#test_board[3] = 1
#test_board[6] = -1
#test_board[4] = 1
#test_board[5] = -1
#test_board[1] = 1
def show_board(board):
    for i in [0, 3, 6]:
        for j in range(3):
            x = board[i + j]
            if x == -1:
                c = '❌'
            elif x == 1:
                c = '⭕'
            else:
                c = '  '
            print(c, end='')
        print(end='\n')
if test_board != 9 * [0]:
    print("\nInitial board position:")
    show_board(test_board)
**** Calculate expectation value of input board position
def expectation(board, player):
if player == -1:
    # **** Find all possible next moves for player 'X'
    moves = possible_moves(board)

    # Calculate expectation of these moves;
    # Player 'X' will only choose the one of maximum value.
    max_v = - math.inf
    for m in moves:
        new_board = board.copy()
        new_board[m] = -1       # Player 'X'

        # If this an ending move?
        r = game_over(new_board, -1)
        if r is not None:
            if r &gt; max_v:
                max_v = r
        else:
            v = expectation(new_board, 1)
            if v &gt; max_v:
                max_v = v
    # show_board(board)
    print(&quot;X's turn.  Expectation w.r.t. Player X =&quot;, max_v, end='\r')
    return max_v

elif player == 1:
    # **** Find all possible next moves for player 'O'
    moves = possible_moves(board)
    # These moves have equal probability
    # print(board, moves)
    p = 1.0 / len(moves)

    # Calculate expectation of these moves;
    # Player 'O' chooses one of them randomly.
    Rx = 0.0        # reward from the perspective of 'X'
    for m in moves:
        new_board = board.copy()
        new_board[m] = 1        # Player 'O'

        # If this an ending move?
        r = game_over(new_board, 1)
        if r is not None:
            if r == 10:             # draw is +10 for either player
                Rx += r * p
            else:
                Rx += - r * p       # sign of reward is reversed
        else:
            v = expectation(new_board, -1)
            Rx += v * p
    # show_board(board)
    print(&quot;O's turn.  Expectation w.r.t. Player X =&quot;, Rx, end='\r')
    return Rx


def possible_moves(board):
    moves = []
    for i in range(9):
        if board[i] == 0:
            moves.append(i)
    return moves
Check only for the given player.
Return reward w.r.t. the specific player.
def game_over(board, player):
    # check horizontal
    for i in [0, 3, 6]:     # for each row
        if board[i + 0] == player and 

           board[i + 1] == player and 

           board[i + 2] == player:
            return 20
# check vertical
for j in [0, 1, 2]:     # for each column
    if board[3 * 0 + j] == player and \
       board[3 * 1 + j] == player and \
       board[3 * 2 + j] == player:
        return 20

# check diagonal
if board[0 + 0] == player and \
   board[3 * 1 + 1] == player and \
   board[3 * 2 + 2] == player:
    return 20

# check backward diagonal
if board[3 * 0 + 2] == player and \
   board[3 * 1 + 1] == player and \
   board[3 * 2 + 0] == player:
    return 20

# return None if game still open
for i in [0, 3, 6]:
    for j in [0, 1, 2]:
        if board[i + j] == 0:
            return None

# For one version of gym TicTacToe, draw = 10 regardless of player;
# Another way is to assign draw = 0.
return 10


print("\u001b[2K\nOptimal value =", expectation(test_board, -1) )

Example output (for X's turn to play):

score 0 · Answer 3 · answered Jun 13 '23 at 09:23

Actually, you can directly calculate the optimal odds to win in TicTacToe against a random opponent and it turns out that you would win 191 out of 192 games and draw the last one when beginning.

For that -unintuitively- you have to start in a corner, lets say top left:

O . .
. . .
. . .

Then one can see that the only way to draw for the second player is to put the X in the middle (so chance for a random opponent: 1/8). If he doesn't, O can force a win, e.g. after

O X . 
. . .
. . .

O can put the marker in the middle, forcing X on the bottom right and then putting an O on the middle left, creating 2 possibilities for a tic-tac-toe.

One can easily also check the other options for X to go wrong and will confirm that only an X in the middle will hold a draw, but to keep the answer brief I will not proof it here.

Then, we can put our next "O" in the top middle, leading to

O O .
. X .
. . .

where the second player needs to put an "X" in the top right (chance 1/6 for a random opponent)

Then we place our "O" in the bottom left to simultaneously deny "X"'s thread and make a thread on our own:

O O X
. X .
O . .

So at this point a random opponent has a 1/4 chance to save the game. After that, we have to play middle right to draw, as we have no more chance to get any Tic-Tac-Toe even with cooperative play from X.

The chance to win is therefore 1 - 1/8*1/6*1/4 = 191/192.

Remark: One can easily see that this is the best strategy for a win. The total number of moves for "X" are 384 (8 on turn one, then 6, then 4, then 2), out of which with this strategy only 2 ways draw. When "O" starts in the middle, with perfect play all 4 corners can draw for "X". Analogously one can justify every subsequent decision for "O".

What is the optimal score for Tic Tac Toe for a reinforcement learning agent against a random opponent?

3 Answers3

Empty board

Pre-moves, if any are desired:

X|O|

O|O|X

X| |

**** Calculate expectation value of input board position

Check only for the given player.

Return reward w.r.t. the specific player.

Linked