Why can I still easily beat my Q-learning agent that was trained against another Q-learning agent to play tic tac toe?

Question

I implemented the Q-learning algorithm to play tic-tac-toe. The AI plays against the same algorithm, but they don't share the same Q matrix. After 200,000 games, I still beat the AI very easily and it's rather dumb. My selection is made by epsilon greedy policy.

What could cause the AI not to learn?

Edit

Here is how I do it (pseudo code):

for(int i = 0; i < 200000; ++i){
    //Game is restarted here
    ticTacToe.play();
}

And in my ticTacToe I have a simple loop :

while(!isFinished()){
    swapPlaying(); //Change the players' turn
    Position toPlay = playing.whereToMove();
applyPosition(toPlay);
playing.update(toPlay);

}
//Here I just update my players whether they won, draw or lost.

In my players, I select the move with epsilon-greedy implemented sa below :

Moves moves = getMoves(); // Return every move available
Qvalues qValues = getQValues(moves); // return only qvalues of interest
//also create the state and add it to the Q-matrix if not already in.
if(!optimal) {
     updateEpsilon(); //I update epsilon with simple linear function epsilon = 1/k, with k being the number of games played.
     double r = (double) rand() / RAND_MAX; // Random between 0 and 1
     if(r < epsilon) { //Exploration
         return randomMove(moves); // Selection of a random move among every move available.
     }
     else {
         return moveWithMaxQValue(qValues);
     }
} else { // If I'm not in the training part anymore
     return moveWithMaxQValue(qValues);
  }

And I update with the following :

double reward = getReward() // Return 1 if game won, -1 if game lost, 0 otherwise
double thisQ, maxQ, newQ;
Grid prevGrid = Grid(*grid); //I have a shared_ptr on the grid for simplicity
prevGrid.removeAt(position) // We remove the action executed before
string state = stateToString(prevGrid);
thisQ = qTable[state][action];
mawQ = maxQValues();
newQ = thisQ + alpha * (reward + gamma*maxQ - thisQ);
qTable[state][action] = newQ;

As mentioned above, both AI have the same algorithm, but they are two distinct instances, so they don't have the same Q-matrix.

I read somewhere on Stack Overflow that I should take in account the movement of the opposite player, but I update a state after player move and opponent move, so I don't think it's necessary.

score 2 · Answer 1 · answered Apr 19 '17 at 13:56

2

I couldn't add a comment because of my low reputation but you can check this. It is about the state space. https://math.stackexchange.com/questions/485752/tictactoe-state-space-choose-calculation

answered Apr 19 '17 at 13:56

Molnár István

712
1
7
11

Why can I still easily beat my Q-learning agent that was trained against another Q-learning agent to play tic tac toe?

Edit

1 Answers1