1

I have a neural network that I'm want to use to self-play Connect Four. The neural network receives the board state and is to provide an estimate of the state's value.

I would then, for each move, use the highest estimate, occasionally I will use one of the other moves for exploration.

I intend to use TD($\lambda$) to calculate the errors for each state to backpropagate through the network.

But I'm confused about when this should actually occur. Do I store the estimate of a state that is used and calculate the error based on the next state chosen?

Or do I store a history of all states and backpropagate only when the game is a win/lose/draw?

I guess overall I'm not sure I understand when the update occurs, partially because I don't quite understand how to implement the lambda? Like if I was to apply to back prop after every move, how would I even know the value of lambda at this time-step before I know how long the game will last?

When self-playing, is the error the difference between that "sides" last move? I.e. I compare move 1 against move 3, and move 2 against move 4, etc?

nbro
  • 42,615
  • 12
  • 119
  • 217
NeomerArcana
  • 220
  • 4
  • 13

0 Answers0