4

Consider a single-player card game which shares many characteristics to "unprofessional" (not being played in casino, refer point 2) Blackjack, i.e.:

  • You're playing against a dealer with fixed rules.
  • You have one card deck which is played completely through.
  • etc. An exact description of the game isn't needed for my question, thus I remain with these simple bullet points.

Especially the second point bears an important implication. The more cards that have been seen, the higher your odds of predicting the subsequent card - up to a 100% probability for the last card. Obviously, this rule allows for precise exploitation of said game.

As far as the action and state space is concerned: The action space is discrete, the player only has a fixed amount of actions (in this case five - due to the missing explanation of the rules, I won't go in-depth about this). Way more important is the state space. In my case, I decided to structure it as follows:

A 2 3 4 5 6 7 8 9 10 J
4 4 4 4 4 4 4 4 4 14 2

First part of my state space describes each card value being left in the stack, thus on the first move all 52 cards are still in the deck. This part alone allows for about 7 million possible variations.

Second part of the state space describes various decks in the game (once again without the rules it's hard to explain in detail). Basically five integers ranging from 0-21, depending on previous actions. Another 200k distinct situations.

Third part are two details - some known cards and information, though they only account for a small factor, but still bring in a considerable amount of variation into my state space.

Thus a complete state space might look something like this: Example one would be the start setup: 444444444142;00000;00. Another example in midst of the game: 42431443081;1704520;013. Semicolons have been added for readability purposes.

So now arises the question: From my understanding my state space is definitely finite and discrete, but too big to be solved by SARSA, Q-learning, Monte Carlo or alike. How can I approach projects with a big state space without loosing a huge chunk of predictability (which I might fear with DQN, DDPQ or TD3)? Peculiarly due to the fact that only one deck is being used - and it's played through in this game - it seems like a more precise solution would be possible.

1 Answers1

2

How can I approach projects with a big state space without loosing a huge chunk of predictability (which I might fear with DQN, DDPQ or TD3)?

You can impact this by choosing a combination of function approximator and engineered features which are a good match to predicting either the value functions or policy function that an agent will need to produce.

It is hard to tell a priori how much work you would need in this regard. Given that you play as many training games as you have time for, use a deep neural network with hardware acceleration, then one approach is to simply normalise/scale your feature vector as it is, and train heavily. This approach in RL has repeatedly shown new state-of-the-art results for e.g. AlphaZero, Open AI's DoTA agent and others. This appears to work provided the compute resource that you can throw at the problem is large enough. As you mention, card-counted blackjack is a solved problem, so doing this may be within your reach on consumer hardware.

Some smart feature engineering may help though, by making the core problem easier for the agent. As this is a hobby project, what you do will depend on what you want the agent to learn. Is the purpose of your project to teach the agent how to card-count from scratch? Given that you have told the agent the count of remaining cards in the state, it does not appear so.

Next question: Do you need the agent to learn to sum up all remaining cards in order to calculate the raw probability of revealing a card of each type amongst cards so far unseen? That's currently what your main state feature is doing. If you don't need the agent to learn that, you could help by using the probability of each card being seen as a feature instead of the remaining count. This feature is likely to reduce the complexity of value and policy functions, thus it will reduce compute time, and may also improve accuracy.

With a game that can be solved analytically, you could take this up to the point of deriving the optimal policy directly and not using RL at all (i.e the input to your NN would be the correct action or the action values!). The question for your project is then this: What precisely are you hoping to demonstrate that your agent can learn? Or maybe: What do you want to gain by applying RL to this problem?

Neil Slater
  • 33,739
  • 3
  • 47
  • 66