Reinforcement learning for Bridge - NN model fails to learn legal actions

Question

Summary

I have built a neural network model for reinforcement learning that is supposed to learn to play the card game of Bridge. So it must predict Q-values of player's actions but also distinguish what actions are allowed.

Earlier, when I used a simple model - with only 56 binary input features the model was able to learn rules of the game. But when I tried to train a model with input layer consisting of 189 input parameters then it failed completely. It converged to predict constant output values, regardless of the input.

I have tried multiple models, training parameters and different sets of input data. I also checked my code multiple times to make sure I do not supply crappy input to the model. But since all seems to be correct and nothing works then I must be doing something fundamentally wrong. Can anybody tell me what it is?

The game

For those who know Bridge - I only focus on the trick-taking part of the game. I assume that the bidding phase is over before AI agents step in. So their decisions are only about which card from the player's hand should be played.

If you do not know Bridge, it is always played by four players - two teams of two. Each player has 13 cards and in each round everybody plays one card - the person who won last trick starts and three other players follow clockwise. The player who has the highest card wins the trick. You must follow the suit (if possible) of the first card played in current round.

So in essence the rules are simple but strategies may be complex considering that every player has 12 decisions to make in each game (for 13 rounds of play minus last one when you toss the last card from your hand). After all 13 tricks are played game result is decided based on how many tricks were won by each team.

The model

So I have 4 NN agents that I am trying to train - one for each player. First, agents must learn game rules and then, at the later stage they should learn winning strategies. The output of each Agent’s model is a Q-value for each card from the deck of 52 cards. So I am training a regression model trying to predict Q-values of 52 actions.

So for the first stage I generated training data that only represents playing rules - output has negative Q-values for illegal cards and positive for those that are allowed. It really becomes a classification problem at this stage but I need this to be a regression model to predict Q-values later.

In the input data I substituted all irrelevant features with zeros - so the model only gets 52 binary features representing player's hand plus 3*4 features representing suits of cards played in current trick (columns 170-181). (It would be sufficient to supply suit of the first card to determine legal actions, but I am supplying all played cards before the current player - it may be 1, 2 or 3)

Example input:
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Expected output:
[[-30. -30. -30. 20. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. 20. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. 20. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30.], [-30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. 20. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30. -30.]]

This is one of the models I tried:

model = keras.Sequential([ layers.InputLayer((189,)), layers.Dropout(0.5), layers.Dense(512, activation='relu'), layers.Dropout(0.2), layers.Dense(360, activation='relu'), layers.Dropout(0.2), layers.Dense(128, activation='relu'), layers.Dropout(0.2), layers.Dense(52, activation='linear') ])

model.compile( optimizer=Adam(learning_rate=0.6), loss='mean_squared_error', metrics=['mae'] )

The result

Below are predictions that I am gettng after training the model. As I am using MSE as a loss function and majority of expected output Q-values are -30 then it is easy to see why the model converged to return values close to -27. But why it does not use information from input when making predictions - it is a mistery to me.

Predictions:
[[-27.64804 -27.620054 -27.642275 -27.61838 -27.66718 -27.614843 -27.613771 -27.668032 -27.650768 -27.469149 -27.312817 -27.155523 -27.011065 -27.687826 -27.66859 -27.66804 -27.680109 -27.67095 -27.695951 -27.6926 -27.689423 -27.696001 -27.488274 -27.363392 -27.213408 -27.084932 -27.707504 -27.746534 -27.71295 -27.766203 -27.750078 -27.73849 -27.766941 -27.756678 -27.76499 -27.588118 -27.441809 -27.27206 -27.153002 -27.804546 -27.811834 -27.803896 -27.812439 -27.82825 -27.807224 -27.810957 -27.801945 -27.833725 -27.655598 -27.488352 -27.386484 -27.215782], [-27.64804 -27.620054 -27.642275 -27.61838 -27.66718 -27.614843 -27.613771 -27.668032 -27.650768 -27.469149 -27.312817 -27.155523 -27.011065 -27.687826 -27.66859 -27.66804 -27.680109 -27.67095 -27.695951 -27.6926 -27.689423 -27.696001 -27.488274 -27.363392 -27.213408 -27.084932 -27.707504 -27.746534 -27.71295 -27.766203 -27.750078 -27.73849 -27.766941 -27.756678 -27.76499 -27.588118 -27.441809 -27.27206 -27.153002 -27.804546 -27.811834 -27.803896 -27.812439 -27.82825 -27.807224 -27.810957 -27.801945 -27.833725 -27.655598 -27.488352 -27.386484 -27.215782]]

score 1 · Answer 1 · answered Jul 07 '24 at 23:07

All right! This is the issue:
Adam(learning_rate=0.6)

This learning rate is much too big. If I had more experience with training NNs, I probably would have spotted it looking at the learning curve. Because my loss has skyrocketed in the first epoch. But then it converged quickly so I thought the learning rate was fine.

The rate of 0.01 worked perfectly. And 0.02 was still too big - it also gave me constant predictions.

I am still a bit puzzled how learning rate really works for Adam, considering this is an adaptive optimizer. But I guess the proficiency to apply specific optimizers will come with years of training :)

Reinforcement learning for Bridge - NN model fails to learn legal actions

Summary

The game

The model

The result

1 Answers1