Juari
From PokerAI
Juari is pokerbot project developed by Nimar Arora which is based on opponent modeling and action selection in poker. The below article is reprinted from the author's home page. On the website, there is also presentation on the bot.
Contents |
Overview
"Juari" (gambler in hindi) is a project aimed at modeling opponents in a game of poker, predicting the opponent hand, and determining the appropriate action to take in a poker game to maximize the expected payoff.
Poker is an interesting game to study since it's a game of chance and of imperfect information unlike backgammon, in which both players have perfect information, or chess, where there is perfect information and no element of chance either. Games such as poker present similar challenges that an intelligent agent in the real world would have to face. As such, these games are of great interest, and it appears that so far machines haven't been able to do better than humans.
In this project, I am taking on a variant of poker known as Texas-hold'em. In this variant each player is initially dealt 2 hole cards which are only known to them. At this point, each player can make his first round bets or "fold." Then three "community" cards are dealt on the table, face up, known as the flop. The players do another round of betting, and then one more community card is dealt known as the "turn". After the turn betting, one final community cards is dealt, known as the "river." The players get to bet once more, and all the players still in the game have to display their cards in the "showdown." During the betting, each player can either raise the bet, match the current bet by calling, or can fold -- i.e. drop out of the game. A player can re-raise a raise, but once all players have called the betting round ends. This a limit variant of the game, which means that there can be at most 4 raises in each round. The size of the bet is fixed at around $10 in the first two rounds and $20 in the last two rounds. One final aspect of the game is known as the "blind." In each round one player has to bet blind, i.e. before seeing his hole cards and another player has to make half a bet before seeing his cards. These players are known as the big-blind and the small-blind respectively. The big blind and the small blind position rotates around the table so that every player gets to be the big blind and the small blind an equal number of times.
Prior Work
Poker has been studied extensively in the game theory community. It has been shown by John Nash that such games admit an equilibrium strategy such that no player can do any better by deviating from his equilibrium strategy. Further, since this is a zero sum game, and symmetrical in all the players, the equilibrium strategy would guarantee each player a zero payoff. Although an equilibrium strategy is optimal against an optimal opponent, there is no guarantee that such a strategy is optimal against sub-optimal strategy. Accordingly, there are two main approaches to poker. One involves finding the optimal strategy and playing it irrespective of the opponent. The other involves modeling the opponent and playing the perfect strategy to exploit the knowledge gleaned from the opponent model.
In order to find the equilibrium strategy for zero-sum two player games, it has been shown by Von Neuman that a min-max equation needs to be solved. This min-max equation can be represented as a linear program, and any linear programming techniques can be used to solve it. The main issue is in the actual representation of the strategy. Till the mid-nineties the best way to represent a randomized strategy was to consider a probability distribution over all possible deterministic strategies. This "normal form" of the payoff matrix is quite hard to deal with since it has an exponential number of rows and columns. It was later show by Koller et al. [1] that the normal form could be converted to the sequence form. In the sequence form, the number of rows and columns of the payoff matrix is linear in the size of the game tree, and hence, this becomes easier to solve. In poker, the size of the game tree is still quite large -- O(10^18). So, some amount of approximation is still needed. One approach known as PsiOpti [2], or psedu optimum, approximates the game of poker to a similar game with only O(10^7) game tree nodes and solves it.
In the opponent modeling approaches, the two main approaches are Bayesian reasoning [3] or expectimax based search [4]. In either approach, an opponent model is made and the current payoff odds are deduced. Then a formula is used to randomly select an action using the expected payoff. One popular approach is to consider the action probabilities proportional to the exponentiation of the expected payoff. Thus, even actions with negative payoff are considered with non-zero probability. This could in some sense be called "bluffing."
My Approach
One approach that I first attempted to play poker was to treat this as a reinforcement learning problem, and to solve it with temporal difference Q-learning. It turns out that this doesn't work very well, and after some analysis and finally figured out the following. Q-learning can be used to mainly learn a deterministic action selection policy. It doesn't help learn a randomized action selection as is needed in poker. Secondly, it works in a mainly static environment. In poker, on the other hand, the opponent is constantly adjust his play based on his model of your own play.
In order to remedy these defects with Q-learning I did do some further reading on the topic. In particular, I found Littman's mini-max Q-learning [5] quite interesting. This is the only known approach where Q-learning actually learns a randomized policy. The problem, however, was that this approach assumed that both players knew the complete game state at each step, and took their actions independently and simulatenously. In poker, this is not a valid assumption, and hence this approach can't be applied at all.
I finally settled on realizing that the best approach is really to model the opponent and to determine the odds of every possible hand that the opponent could be holding. Then one can determine the weighted odds of victory and play accordingly. My main approach then focused on opponent modeling and action selection.
Opponent Modeling
There are two main approaches for opponent modeling in poker -- the strategy class and the observation class [4]. In the strategy class, one tries to learn the opponent's strategy by deducing the odds of each action in every possible state of the opponent. Then by looking at the opponent actions one can predict the odds of the different states the opponent could be in -- the belief state. With the belief state one can deduce one's own odds of winning. The main problem with this approach is that it relies on the cards that an opponent reveals at showdown to deduce the opponent's strategy. However, most games in poker don't proceed to showdown. Thus the model is in general more accurate for the type of hands which do proceed to showdown.
The observation class of modeling, on the other hand, doesn't rely on the showdown cards. It instead tracks frequency of actions taken by the opponent and attempts to predict the opponent's actions based solely on the frequency of his actions. This is the approach used, for example, in [4] combined with an expecti-max search to predict the expected payoff.
My Opponent Modeling and Belief State Maintenance
My approach combines the observation class with the assumption that less frequent actions imply stronger hands. Any opponent modeling approach has two main components -- observation and prediction. In my approach, I observe and record the amount of money wagered by each player in each round of the game. Over many games, I can build up a histogram of bets made in each round of the game. For purposes of prediction, I convert the histogram of actions into a probability of occurrence of the oberved action -- which I call the "action probability." The action probability plugs into the prediction algorithm as described in the following.
In any state of the game, based on the visible community cards, a total order can be defined on all possible hole cards that the opponent could have. The total order is based on the odds of winning of the hole cards. Given some hole cards and the visible community cards, I simulate many iterations of the game, by randomly drawing from a deck all possible remaining cards, and counting the number of times that the given hole cards won. This can be repeated for all possible hole cards to get the desired total order. The total order over the hands can be used to compute a relative hand strength for each possible hand. Note that the relative hand strength of a hand changes as different community cards are made visible.
In order to generate a belief state over the opponent hands, I need to maintain a weight for each opponent hand. The above description shows my calculation of the action probability and the relative hand strength of each hand. All that remains to be done is to down-weight hands whose relative hand strength doesn't conform to the action probability observed for the opponent. Consider, the following example. Say, I have previously observed than some opponent makes two bets or higher in the pre-flop round only 30% of the time, and 3 bets or higher only 20% of the time. Now, if the player "calls" 2 bets, i.e. he is signalling that he wants to make only 2 bets but not 3 or more. In this case, I would assume that his hand is in the top 30% of all hands but not the top 20%. Thus all possible hands whose relative strength is outside the 70% - 80% band get downweighted. In the same example, if the opponent had instead raised and brought his bet level up to 2, then one has to assume that the opponent could have gone all the way up to 4 bets. In that case, only the hands below the 70% relative strength level would get down-weighted.
My opponent modeling and prediction proceeds as follows. For every board configuration, I compute the relative hand strength of every possible opponent hand. Then for every possible opponent action, I compute the action probability range for that action. With the action probability range, I down-weight all hands whose relative hand strength lies outside this range. By repeating the downweighting process on every action I compute a vector of weights. By normalizing the vector of weights, I would effectively have a "belief state." However, since I never really need the precise probability of being in a particular state, I never normalize the weights.
Expected Payoff and Action Selection
The belief vector of unnormalized weights can be used to directly compute my odds of winning. The previous computation had assigned a relative hand strength to each possible hand, including the current hand that I hold. Now, I compare the weights of all opponent hands which have a lower relative hand strength to all hands weights with a higher hand strength to compute my probability of winning. My expected winning is at least the current pot size times this probability of winning. Now different actions have different expected payoff because each action requires me to put in a different amount of money in the pot. Thus, I can compute the expected payoff for each action. Exponentiating the expected payoff of each action, and normalizing over all actions gives me odds for each action. A naive strategy is to select an action randomly using these odds. A more sophisticated strategy is to make actions deterministically in some situations where an action has an obvious advantage.
When choosing deterministic actions, it is important to not reveal too much of one's hand. For example, if one has a strong hand one could deterministically decide to raise the bet level. Now, this might reveal to the opponent the strength of one's hand, but if I have been raising with weak hands in the past then the oppponent doesn't really know if I have a strong hand or not. Thus it is safe to bet aggressively with strong hands as long as one bluffs often enough. Another obvious rule is to never fold if the odds of winning are more than 50%.
My action selection is thus a combination of rule based and probabilistic methods which use the belief state over the opponent hands and the current pot size to compute the expected payoff of each action.
Software
In order to test my strategy against state of the art poker players, I have developed a program which can connect to the University of Alberta's poker server and play against other poker playing programs written by various researches around the world. In particular, the ones developed by University of Alberta are known to be the best.
The attached program can be executed as "python main.py <user name> <password>." It connects to the poker server and starts playing. The program currently prints a lot of debugging output to explain its thinking. In particular the strongest opponent hand among the most likely opponent hands, the current odds of winning etc.
My experimentation has showed that the poker player does an extremely good job of predicting the opponent hand. It is not able to win against the state of the art poker players, but appears to be breaking even. This is actually much better than the dumb strategy of folding every hand.
Conclusions and Future Work
My poker opponent modeling and prediction algorithm is much simpler than any prior work. I use the simple assumtion that the odds of the opponent action is corelated to the opponent's relative hand strength. This gives a simple but fairly accurate opponent model. The fact that I'm not able to win a lot of money against the best poker programs probably says that my action selection algorithm needs further work. Also, I need to play the program in heads-up mode where the action selection is relatively simpler than playing ring poker (i.e. against more than one opponent).
References
- [1] Daphne Koller and Avi Pfeffer, "Generating and Solving Imperfect
Information Games."
- [2] D. Billings, N. Burch et. al., "Approximating Game-Theoretic Optimal
Strategies for Full-scale Poker."
- [3] Ann E. Nicholson, Kevin B. Korb, and Darren Boulton, "Using Bayesian
Decision Networks to Play Texas Hold'em Poker."
- [4] Terence Conrad Schauenber, "Opponent Modeling and Search in Poker."
- [5] Michael Littman, "Markov games as a framework for multi-agent
reinforcement learning."
Links
- Nimar Arora home page
