1

i implemented the tabular q-learning algorithm for 3x3 tictactoe multiple times and everytime the number of entries in the q-table is 16,167. I wanna know how to calculate the number of 16,167. what is the formula to calculate it. the q-learning-algorithm also uses all symmetries. 0,90,180,270 degree rotations, horizontal, vertical, diagonal, antidiagonal.

i know that a 3x3 Tic-Tac-Toe board can be in any combination of three states (empty, X, O) across 9 positions which means that it has 3⁹=19,683 possible configurations

can anybody please help me get the formula to calculate it?

1st EDIT: An example entry of my q-table:

('_XXX_O_OO', 6): 0.9024999999999993

2nd EDIT:

"('_________', 2)": 0.06166253080818989,
"('_________', 8)": 0.06166253080818989,
"('_________', 0)": 0.06166253080818989,
"('_________', 6)": 0.06166253080818989,

"('_________', 3)": 0.030324239570976224, "('_________', 7)": 0.030324239570976224, "('_________', 5)": 0.030324239570976224, "('_________', 1)": 0.030324239570976224,

"('_________', 4)": 0.15534933785554575,

I save the board and all of its symmetries inside the q-table.

3rd EDIT:

my update_q_table function:

def update_q_table(
    board,
    action,
    reward,
    new_board,
    alpha,
    gamma,
    future_value,
):
    global Q_TABLE_DICT
    symmetric_states_and_actions = generate_symmetric_states_and_actions(board, action)
old_value = Q_TABLE_DICT.get((board, action), 0)

if game_over(new_board):
    # If it's a terminal state, there are no future rewards to consider
    next_max = 0
else:
    next_max = future_value
temporal_difference = reward + (gamma * next_max) - old_value
new_q_value = old_value + (alpha * temporal_difference)
# print(new_q_value)
# Update the Q-value for the current state-action pair and its symmetric states
for symmetric_state, symmetric_action in symmetric_states_and_actions:
    Q_TABLE_DICT[(symmetric_state, symmetric_action)] = new_q_value

return new_q_value

Hans123
  • 25
  • 5

2 Answers2

1

The bad news is that there is no simple formula for calculating the state space or state+action space here. It is relatively easy to get upper or lower bounds to it, but complex to precisely calculate. Writing a maths formula for it would look like pseudocode but written in maths notation - a procedure, rather than a simple expression.

Your calcuation of $3^9$ is an overestimate of the state space (i.e. ignoring actions), based on the full gamut of the state representation. It's an overestimate because it includes many impossible unreachable states such as XX_______ or ____O____ where the turn sequence has not been followed. You could extend this representation space analysis though, by multiplying by all 9 actions to arrive at $9 \times 3^9 = 177141$. This is an upper bound on the size of your Q table. As well as impossible states, it will include incorrect state and action combinations such as (____X____, 4).

Another way to arrive at a strict upper bound is to count all the nodes in the game decision tree. At the start, player X must make one of 9 decisions, player O must take one of the remaining 8 etc. This leads to the upper bound of $\sum_{t=0}^8 \prod_{c=9-t}^9 c$ (where $c$ is the number of choices available at time step $t$). This resolves to $986409$ - higher than the number of representations! That's because many of the combinations end up in the same state - action sequence $(4,5,0)$ is the same board as $(0,5,4)$. This will also include states that continue play after X or O have won.

You can refine calculations by filtering, but there is no simple formula for that, because you have to apply the game rules, including turn-taking and winning conditions, to determine whether a state and action is valid. You will need to script something that generates one of the larger upper bound sets, then filter logically.

I would suggest that the easiest calculation here is to start with all possible state representations and filter as follows:

  • Remove all states where count of X is either less than count of O, or is more than $1$ higher.
  • Remove all states which show either X or O (or both!) completing a line, because no action is possible from those states, so they are never added to the Q table.

Then multiply each state by the count of _, and sum to get number of actions, thus expected size of a completed Q table. Note this correctly assigns a count of $0$ for full boards for drawn games, they won't appear in the Q table (it doesn't matter if you also want to filter these out).

I have tested this with a short script and obtained the number $16167$. So you can have some confidence that your learning algorithm is exploring all options.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
1

Great answer from @Neil Slater; however, I wanted to give my 2 cents.

Even though it's called "tabular" q-learning, it doesn't imply that you have to use a table to access it; yes, a table might have some advantages, but there are other data structures out there

In particular, you can use a map instead (key-value data structure, equivalent of a dictionary in python) where you pass some sort of "string representation of your board" as key, and have as value an array of N possible moves

Now the question is, do we really want to save that little memory, paying the additional cost of a map, or we can live with a sparse table with entries that are never used? I'd say the latter, as it's much easier to vectorize accesses and updates of such table

Furthermore, if memory is your concerns, consider that tictactoe has a ton of symmetries: you can flip the board horizontally, vertically, rotate it, and flip the entries (X instead of O and viceversa) which will definitely save a lot of memories

Alberto
  • 2,863
  • 5
  • 12