Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?
Above is an extract from Reinforcement Learning: An Introduction by Andrew Barto and Richard S. Sutton, and I wasn't quite sure about what the answer to the question would be, so thought of posting it here. The algorithm being referred to is the one for playing the game tic-tac-toe.
In my opinion, if the same algorithm plays both sides, it may end up assisting itself to win every time - and not really learn anything. What do you think?