I have been spending a few days trying to wrap my head around how and why neural networks are used to play chess.
Although I know very little about how the game of chess works, I can understand the following idea. Theoretically, we could make a "tree" that included every possible outcome of a chess game. Through knowledge provided by chess experts, we could identify how "favorable" certain parts of this tree are compared to other parts of the tree. We could also use this tree to "rank" optimal chess moves based on how the chess board appears in the current turn (e.g. which pieces you and your opponent have left and where these pieces are situated).
The problem is, this tree would be so enormous that it would be impossible to create, store and "search" (e.g. with the MinMax algorithm):
I understand that perhaps this tree can be created using data to limit the size of the tree based on scenarios that are more likely to appear compared to all possible scenarios. For example, if a player wanted they could spend the whole game aimlessly shifting their "Rook" back and forth - theoretically, this outcome could occur but no player (in their sane mind) would ever do this. Thus, the tree could be constructed using actual data from millions of chess games. This for example could tell us : Based on historical data and given the current setup of the chess board, 21% of games were won when the immediate next move involved moving the Queen to "F5" vs only 3% of games were won when the immediate next move involved moving the Knight to "F5". I suppose at each move, the data based tree could be queried to rank the optimality of each next move by checking the proportion of "terminal nodes" that resulted in wins for each possible move given the current chess board.
However, I still see 2 problems with this approach:
- It is possible that we might run into a scenario(s) that never occurred within the historical data, rendering the tree useless in this scenario 
- This tree still might be too large to efficiently store and query. 
This is probably why neural networks are being used to play chess - I tried to do some readings about this topic, but I can't seem to fully understand it. In this case, what exactly would the neural network use as a loss function? I don't see how the loss function in this case is continuous, and thus how could gradient descent be used on such a loss function?
Could someone please recommend some sources (e.g. YouTube Videos, Blogs, etc.) that show how a neural network can be used to play chess.

 
     
     
    