Black Jack and Reinforcement Learning

Java Black Jack and Reinforcement Learning

by Frederic Meyer, Logic Systems Laboratory, EPFL, 1998

Introduction

Blackjack or twenty-one is a card game where the player attempts to beat the dealer, by obtaining a sum of card values that is equal to or less than 21 so that his total is higher than the dealer's. The probabilistic nature of the game makes it an interesting testbed problem for learning algorithms, though the problem of learning a good playing strategy is not obvious. Learning with a teacher systems are not very useful since the target outputs for a given stage of the game are not known. Instead, the learning system has to explore different actions and develop a certain strategy by selectively retaining the actions that maximize the player's performance. We have explored the use of blackjack as a test bed for learning strategies in neural networks, and specifically with reinforcement learning techniques [1].

This Java applet implements a simplified version of the game of Black Jack. One or two players can play against the dealer (i.e., the casino). Though one or both players can be set to be your computer.
By default, the computer plays in a random manner. However, you may let it play against the dealer and learn to play Black Jack from experience. The learning algorithm it may use is called the SARSA algorithm, a reinforcement learning algorithm introduced by G.Rummery and M.Niranjan [2].

A complete introduction to reinforcement learning can be found in the new book by R. Sutton and A. Barto [3]. For futher information on reinforcement learning and Black Jack playing, you may refer to the www page Learning to Play Black Jack with Artificial Neural Networks, also maintained by the LSL lab at the EPFL.

Instructions

There are two basic options: play and learn. By default, the applet starts with the learn option. You may choose to play just by pressing the left button PLAY in the applet.

Learn

1. In the first window Learning you may select the different learning options, that is, the number of episodes to train the computer and the number of and games per episode.

2. The Start Learning button starts training. You may Suspend Learning or Stop Learning at any time by pressing the corresponding button in the Learning window.

3. The Estimate Fct. window enables the user to modify the external reinforcement values the learner receives when it wins or loses.

4. The Alpha and Gamma constants are the step-size parameter and the discount factor in the SARSA basic equation:

5. The Action Selection window permits the user to select an epsilon-greedy (and the corresponding 0 < epsilon < 1 value) or a SoftMax action selection mechanism. Higher epsilon values indicate higher exploration.

6. The dealer normally uses a fixed strategy: to stop hitting at 17. However, you may explore with other fixed strategies.

7. The Information window presents the percentage of games won and the current learned Q-values.

Play

1. In Play mode, the applet starts by default with the Black Jack window. However, if you want your computer to use a learned strategy you have to select it in the Preferences.

2. In the Preferences window, you can select whether player 1 or player 2 or both is a human or the computer. If a computer, you may select whether you want it to play randomly or using the current learned strategy. Whenever you change something here, you have to press Set to make your choice valid. Set also resets the counters in the BlackJack window.

3. When Playing you just have to select hit or stand appropriately and deal to play a new game.

Observations

The optimal Black Jack strategy is called the Thorp's strategy [4]. This optimal strategy permits a player to win less than 50% of the time against the dealer's strategy, so do not expect to see your computer winning 80% of the games...
Black Jack is a very random game. Considering that a random player wins about 30% of the time, and that the optimal blackjack strategy lets us win less than 50% of the time, it is not easy to see the "intelligence" in a certain player. However, we challenge you to play together with your computer against the dealer and then see the number of games you and the computer won.

Java sources

The following gzip'd tar file contains the original Java source code implemented by F. Meyer (frederic.meyer@studi.epfl.ch) during a semester project.

References

[1] A. Perez-Uribe and E. Sanchez, "Blackjack as a Test Bed for Learning Strategies in Neural Networks", Proceedings of the IEEE International Joint Conference on Neural Networks IJCNN'98 (to appear)

[2] G. Rummery and M. Niranjan, ``On-line q-learning using connectionist systems,'',Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department, 1994.

[3] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998.

[4] B. Widrow, N. Gupta, and S. Maitra, ``Punish/Reward: Learning with a Critic in Adaptive Threshold Systems,'', IEEE Transactions on Systems, Man and Cybernetics, vol. 3, no.5, pp. 455--465, 1973.

[*] Some Reinforcement Learning WWW Links