3

I implemented the Q-learning algorithm to play tic-tac-toe. The AI plays against the same algorithm, but they don't share the same Q matrix. After 200,000 games, I still beat the AI very easily and it's rather dumb. My selection is made by epsilon greedy policy.

What could cause the AI not to learn?


Edit

Here is how I do it (pseudo code):

for(int i = 0; i < 200000; ++i){
    //Game is restarted here
    ticTacToe.play();
}

And in my ticTacToe I have a simple loop :

while(!isFinished()){
    swapPlaying(); //Change the players' turn
    Position toPlay = playing.whereToMove();
applyPosition(toPlay);
playing.update(toPlay);

}

//Here I just update my players whether they won, draw or lost.

In my players, I select the move with epsilon-greedy implemented sa below :

Moves moves = getMoves(); // Return every move available
Qvalues qValues = getQValues(moves); // return only qvalues of interest
//also create the state and add it to the Q-matrix if not already in.

if(!optimal) { updateEpsilon(); //I update epsilon with simple linear function epsilon = 1/k, with k being the number of games played. double r = (double) rand() / RAND_MAX; // Random between 0 and 1 if(r < epsilon) { //Exploration return randomMove(moves); // Selection of a random move among every move available. } else { return moveWithMaxQValue(qValues); } } else { // If I'm not in the training part anymore return moveWithMaxQValue(qValues); }

And I update with the following :

double reward = getReward() // Return 1 if game won, -1 if game lost, 0 otherwise
double thisQ, maxQ, newQ;
Grid prevGrid = Grid(*grid); //I have a shared_ptr on the grid for simplicity
prevGrid.removeAt(position) // We remove the action executed before

string state = stateToString(prevGrid); thisQ = qTable[state][action]; mawQ = maxQValues();

newQ = thisQ + alpha * (reward + gamma*maxQ - thisQ); qTable[state][action] = newQ;

As mentioned above, both AI have the same algorithm, but they are two distinct instances, so they don't have the same Q-matrix.

I read somewhere on Stack Overflow that I should take in account the movement of the opposite player, but I update a state after player move and opponent move, so I don't think it's necessary.

nbro
  • 42,615
  • 12
  • 119
  • 217
Irindul
  • 49
  • 7

1 Answers1

2

I couldn't add a comment because of my low reputation but you can check this. It is about the state space. https://math.stackexchange.com/questions/485752/tictactoe-state-space-choose-calculation

Molnár István
  • 712
  • 1
  • 7
  • 11