Reinforcement learning with Blackjack

Overview
Starter code
Peeking Blackjack
Part 1: building the Markov Decision Process (MDP)
- More details
- Your task
Part 2: Adding Q-learning
- Implement Q-learning
Part 3: A better feature extractor
Grading

This is a pair assignment, which means that if you are working on a team with someone else, you and your partner should do your best to engage in the pair programming model. At any point in time, one of you is “driving,” i.e. actually using the keyboard and mouse. The other is actively engaged following along, preventing bugs, and providing ideas.

You should not divide the work into individual pieces. In other words, you should not split the assignment and work on different portions. Likewise, you should not work on the assignment without both of you being present.

You should make sure that over the course of an assignment that you spend roughly the same amount of time each “driving.” I will also ask you to turn in a form rating the work that your partner does. My recommendation is to take turns approximately every 15 minutes or so. Set a timer to help you remember.

If you are working in a pair, only one of you needs to submit your work via Moodle.

Overview

For this assignment, you’ll create an AI that learns via reinforcement learning how to play a variant of the game Blackjack. You’ll then compare the strategies it learns, compared to the optimal ones.

Starter code

Here is a zip file with the starter code you’ll need. Unzip it somewhere where you can work on it.

Peeking Blackjack

We’ll be playing a variation of the game Blackjack that we’ll call “Peeking Blackjack,” which will be based on a dramatically simplified version of the basic Blackjack game. Here’s an overview of the rules. Blackjack is a solo game, where your goal is to maximize your reward (i.e., your “score”).

Shuffle a deck of cards.
Begin taking cards one at a time and put them in your hand. In our case, every card simply has a number indicating how much it is worth. (We’ll have no face cards, or aces, or anything like that.)
Your goal is to get the highest total in your hand without going over a predetermined threshold (For traditional Blackjack, that threshold is 21. We’ll have it set to a parameter so it can vary.)
If the total of the cards in your hand go over the threshold, the game is over and you get no reward for the cards in your hand.
You can stop drawing cards at any time and quit.
When you quit — if you haven’t gone over the threshold — your reward is the total of the cards in your hand.
At any time, instead of taking a card, you can instead first peek at the next card in the deck that’s coming. (That’s different from regular Blackjack.) That will tell you what the next card is. But every time you peek, you get a negative reward.

Part 1: building the Markov Decision Process (MDP)

You will be creating an MDP to describe states, actions, and rewards in this game. More specifically, in the starter code that I provide, you will implement the transition and reward function of the Blackjack MDP inside succAndProbReward(). Given a state of the game, this function will return a list of possible successor states, the probabilities that each successor state might occur, and the rewards associated with each successor state.

More details

The deck can contain an arbitrary collection of cards with different face values. At the start of the game, the deck contains the same number of each cards of each face value; we call this number the ’multiplicity’. For example, a standard deck of 52 cards would have face values \([1, 2, \ldots, 13]\) and multiplicity 4. You could also have a deck with face values \([1,5,20]\); if we used multiplicity 10 in this case, there would be 30 cards in total (10 each of 1s, 5s, and 20s). The deck is shuffled, meaning that each permutation of the cards is equally likely.

Here is a flowchart that shows how the game will run. Below it, there’s more description explaining how it all works.

The game occurs in a sequence of rounds. In each round, the player has three actions available:

\(a_\text{take}\) - Take the next card from the top of the deck.
\(a_\text{peek}\) - Peek at the next card on the top of the deck.
\(a_\text{quit}\) - Stop taking any more cards.

In this problem, your state \(s\) will be represented as a 3-element named tuple:

State(handTotal, nextCard, deckCardCounts)

where handTotal contains the current total of cards in your hand, nextCard contains the next face down card at the top of the deck (you only know what that is if you peeked), and deckCardCounts contains the counts of each card remaining in the deck.

As an example, assume the deck has card values \([1, 2, 3]\) with multiplicity 2, and the threshold is 4. Initially, the player has no cards, so her total is 0; this corresponds to state

State(handTotal=0, nextCard=None, deckCardCounts=(2,2,2))

For \(a_\text{take}\), the three possible successor states (each with equal probability of \(1/3\)) are:
```
  State(handTotal=1, nextCard=None, deckCardCounts=(1, 2, 2))
  State(handTotal=2, nextCard=None, deckCardCounts=(2, 1, 2))
  State(handTotal=3, nextCard=None, deckCardCounts=(2, 2, 1))
```
Three successor states have equal probabilities because each face value had the same amount of cards in the deck. In other words, a random card that is available in the deck is drawn and its corresponding count in the deck is then decremented. succAndProbReward() will expect you return all three of the successor states shown above. Note that the reward associated with a state after a “take” action is always 0. Even though the agent now has a card in their hand for which they may receive a reward at the end of the game, the reward is not actually granted until the game ends (see termination conditions below).
For \(a_\text{peek}\), the three possible successor states are:
```
  State(handTotal=0, nextCard=0, deckCardCounts=(2, 2, 2))
  State(handTotal=0, nextCard=1, deckCardCounts=(2, 2, 2))
  State(handTotal=0, nextCard=2, deckCardCounts=(2, 2, 2))
```
Note that it is not possible to peek twice in a row; if the player peeks twice in a row, then succAndProbReward() should return []. Additionally, the reward associated with the state after peeking is -peekCost. That is, the agent will receive an immediate reward of -peekCost for reaching any of these states.

Things to remember about the states after taking \(a_\text{peek}\):
- From State(handTotal=0, nextCard=0, deckCardCounts=(2, 2, 2)), taking a card will lead to the state State(handTotal=1, nextCard=None, deckCardCounts=(1, 2, 2)) deterministically (that is, with probability 1.0).
- The nextCard element of the state tuple is not the face value of the card that will be drawn next, but the index into the deckCardCounts tuple of the card that will be drawn next. In other words, the second element will always be between 0 and len(deckCardCounts)-1, inclusive.
For \(a_\text{quit}\), the resulting state will be State(handTotal=0, nextCard=None, deckCardCounts=None). (Setting the deck to None signifies the end of the game.)

The game continues until one of the following termination conditions becomes true:

The player chooses \(a_\text{quit}\), in which case a reward is given of the sum of the face values of the cards in hand.
The player chooses \(a_\text{take}\) and “goes bust”. This means that the sum of the face values of the cards in hand is strictly greater than the threshold specified at the start of the game. If this happens, the end-of-game reward is 0.
The deck runs out of cards, in which case it is as if the agent selects \(a_\text{quit}\). Grant a reward which is the sum of the cards in her hand. But make sure that if you take the last card and go bust, then the reward becomes 0 not the sum of values of cards.

As another example with our deck of \([1,2,3]\) and multiplicity 1, let’s say the player’s current state is State(handTotal=3, nextCard=None, deckCardCounts=(1, 1, 0)), and the threshold remains 4.

For \(a_\text{quit}\), the successor state will be State(handTotal=3, nextCard=None, deckCardCounts=None).
For \(a_\text{take}\), the successor states are State(handTotal=3 + 1, nextCard=None, deckCardCounts=(0, 1, 0)) or State(handTotal=3 + 2, nextCard=None, deckCardCounts=None). Each has a probabiliy of \(1/2\) since 2 cards remain in the deck. Note that in the second successor state, the deck is set to None to signify the game ended with a bust. You should also set the deck to None if the deck runs out of cards.

Your task

Implement the succAndProbReward(state, action) function of class BlackjackMDP, inside submission.py. It should return a list of PossibleResult tuples, where a PossibleResult is 3-valued named tuple defined as:

PossibleResult(successor: State, probability: float, reward: int)

You can test your code by running the following two sets of tests:

python grader.py mdp-basic
python grader.py mdp-values

The first tests (mdp-basic) are unit tests: they check to see if the result of succAndProbReward appears as it should for a variety of inputs. The second tests (mdp-values) actually runs value iteration on the MDP (which needs your succAndProbReward code) and compares the results with the correct ones.

Again, the idea is that succAndProbReward takes a state and action as parameters, and it returns a list of all of the possibilities that could happen. Each of those possibilities contains a successor state (that’s \(s'\)), a probability of that happening, and an immediate reward associated with that possibility.

(If you experience a time out when running the tests, you likely have coded something inefficiently. This should run plenty fast.)

Part 2: Adding Q-learning

For Part 1, we haven’t done any learning yet. You’ve simply calculated the optimal probabilities, and I’ve supplied value iteration code so that the optimal policy could be determined. But suppose you go into a casino, and no one tells you the rewards or the transitions. We will see how reinforcement learning can allow you to play the game and learn a strategy, and you’ll measure how close your approach comes to the optimal one.

Implement Q-learning

You will first implement a generic Q-learning algorithm. Your code goes into the method incorporateFeedback, in the class QLearningAlgorithm which you’ll find in submission.py. QLearningAlgorithm keeps track of all of the numeric data needed to do the learning; incorporateFeedback will be called by the simulation code I’ve provided whenever an update needs to be made.

We are using approximate Q-learning, which means \(Q(s, a) = \sum_i w_i f_i (s, a)\), where in code, \(w\) is self.weights, \(f\) is the featureExtractor function, and \(Q(s,a)\) is self.getQ.

The simulation code that I’ve provided already chooses an action via a simple \(\varepsilon\) -greedy policy. Your job is to implement QLearningAlgorithm.incorporateFeedback(), which should take an \((s, a, r, s')\) tuple and update self.weights according to the standard Q-learning update.

In order to test your Q-learning code, we’ve got to pick a feature extractor. For this portion of the assignment, we’ll use the provided identityFeatureExtractor(state, action), which treats every possible state/action pair in the problem as a feature, and identityFeatureExtractor returns 1 for that state/action pair. This is already provided, and you don’t need to do any work here, but I’m explaining so you know what’s going on. You can run the following two tests, one which runs on a small MDP, and one which runs on a large one:

python grader.py qlearn-basic
python grader.py qlearn-id-small
python grader.py qlearn-id-large

The small test will output each state/action pair so you can see the optimal choices (from the value iteration algorithm), and compare it with the result from Q-learning. Both the small and large tests will also show you the number of states where the same action is chosen, as well as the average reward both approaches give after some more simulations.

If your approach is working, the small problem should have very high similarity top the optimal approach, but it should not work well on the large one. (Why?)

Part 3: A better feature extractor

The problem with the previous approach on the large problem is that there were just too many possible states. This is the reason approximate Q-learning exists, and we should instead be using states which summarize and approximate the situation. To that end, you should code a new feature extractor named blackjackFeatureExtractor as described in the code comments. Once you have coded it, you should be able to test it via:

python grader.py qlearn-bj-small
python grader.py qlearn-bj-large

Using this feature extractor, you should be able to get an average reward that is pretty close to the optimum values on the largeMDP. Interestingly, the policy is still not identical. (At least, it wasn’t for me.)

Grading

To get a grade of “meets expectations,” you should have:

all of the tests from the mdp-basics tests passing
the results from qlearn-id-small results should be roughly the same as those produced from value iteration
when looking at your code, it appears that it was implemented correctly (i.e., you haven’t hardwired your code for the tests)

To get a grade of “exemplary,” you should have accomplished everything for “meets expectations,” plus:

The qlearn-bj-large test gets an average reward that’s close to the value iteration average reward

Many thanks to Percy Liang, who initially created this assignment and granted me permission to use and modify it.