## Tuesday, 15 February 2011

### Regex Tennis

Game History Validation

This example from the Universe of Regular Expressions illustrates a substantial, real world application of the "AB alphabet" type, famous from countless introductions and tutorials. But this example didn't arise at work. It came up while we were relaxing on holiday one summer, and watching Wimbledon. Suddenly I started sketching state diagrams on the backs of Embo postcards...

Suppose during a tennis match we want to record more than just game and set scores; we'd like to record the detailed sequence of points won and lost in each game. So for example, if player A won a "love game" (in which player B failed to score at all), we might record the four points that she won in this format:
AAAA
If instead she lost just one point out of five, but went on to win, then the game will be represented by one of these:
BAAAA
ABAAA
AABAA
AAABA
depending upon whether she lost the first, second, third or fourth point (she can't have lost the final fifth point, since she did win the game). Similarly if she lost two points, well, those could be any two out of the first five points. Applying binomial coefficients, "five choose two" = ten possibilities, and the game's history will be one of these:
BBAAAA
BABAAA
BAABAA
BAAABA
ABBAAA
ABABAA
ABAABA
AABBAA
AABABA
AAABBA
Now suppose we wish to validate such game histories. Can we use a Regex to determine whether a given sequence of As and Bs represents a legal game of tennis? Well, the scoring in any sport can be represented by a finite state machine, so yes, a Regex can certainly validate a game of tennis. But before proceeding, it's worth mentioning that Regex has no built-in support for permutations. That sometimes makes it the wrong tool for jobs like this one, which may have rather long solutions as a consequence. Solution sizes are exponential in the alphabet size, to be exact. Our alphabet has just two letters, so we'll persevere for now.

The examples given so far represent full, legal games. Illegal examples include such things as A, AA, AAA (these games are still in progress), or AAAAB (player A has already won the game before B "wins" that impossible final fifth point). In fact, when taken together with their opposites, i.e. the corresponding cases where B wins instead of A, the foregoing 15 cases already exhaust all 30 possibilities for what we'll call a short game (up to six points). So, by disjoining ( | ) all of the foregoing examples and their opposites, then forcing a full game match by delimiting the result with ^ and \$, we can obtain a canonical pattern capable of validating all short games:
^(AAAA|BBBB|BAAAA|ABBBB|ABAAA|BABBB|...|AAABBA|BBBAAB)\$
But we can also do a lot better. Here is a rough sketch of the state transition diagram for an arbitrary game. For simplicity, tiebreak games are excluded from this treatment, but aside from the game length, their analysis is essentially the same. Scoring on the diagram proceeds from left to right, except when forced back (arrows) from Adv_A or Adv_B to Deuce. The transition is upward whenever A wins the point, downward when B.

Our so-called short games correspond to all the valid paths through this diagram, from Start to Win_A or Win_B, and avoiding Deuce. It so happens that the analysis used here is better explained in a later part of this problem, so for now I'll just pull out of my hat this improved pattern, which matches all 30 (and only those) short games:
(AAAB?B?A|(AAB|ABA|BAA)(AB?A|BAA)|(ABB|BAB|BBA)AAA
|BBBA?A?B|(BBA|BAB|ABB)(BA?B|ABB)|(BAA|ABA|AAB)BBB)
Next we address the remaining long games. After six equally shared points, the score is "40-40", aka "Deuce". For either player to win the game from this point, she must score a further two consecutive points, making it therefore a game of 8 (or 10, or 12, ...) points in total. Now, applying binomials again, there are "six choose three" = twenty ways to reach this intermediate Deuce state, beginning at Start. This portion of the game history comprises 3 As and 3 Bs, intermixed in any one of these 20 possible sequences.

Divide and Conquer!

Cut the play in half. If there are 3 As in the first half, then there must be 3 Bs in the second:
AAABBB
Or if there are 2 As and a B in the first half, then the second must comprise some combination of one more A and 2 Bs:
(AAB|ABA|BAA)(ABB|BAB|BBA)
And so on. There are only four such partitions, and when we gather them together in a disjunction, we obtain the pattern matching any sequence of play from Start to Deuce:
(AAABBB|(AAB|ABA|BAA)(ABB|BAB|BBA)
|BBBAAA|(ABB|BAB|BBA)(AAB|ABA|BAA))
The long game pattern is completed by tacking on to this stem the playoff stage, in which the winner concludes with two consecutive points. This is simply any number (possibly zero) of AB or BA pairs, followed by a final AA or BB. Converting these words into symbols:
(AB|BA)*(AA|BB)
To obtain the final Regex pattern of a tennis game, take the disjunction of the patterns above for matching short and long games:
(AAAB?B?A|(AAB|ABA|BAA)(AB?A|BAA)|(ABB|BAB|BBA)AAA
|BBBA?A?B|(BBA|BAB|ABB)(BA?B|ABB)|(BAA|ABA|AAB)BBB|
(AAABBB|(AAB|ABA|BAA)(ABB|BAB|BBA)
|BBBAAA|(ABB|BAB|BBA)(AAB|ABA|BAA))(AB|BA)*(AA|BB))
An Alternative Approach

The above is a complete solution, and despite the pattern lengths involved, still a practical one, since these patterns or equivalents can easily be autogenerated. However, the autogeneration process is exactly equivalent to walking through the state diagram. If we can do that, then we already have a finite state machine capable of game history validation.

Class TennisGame below is one example of such a state machine, encoding the scoring rules of tennis games. It has a Play method which accepts one string parameter purporting to be a game history, and returns the validation result as a boolean. Any two adjacent characters can be used for the alphabet, so for example, a game history can be written as ABAAA, or equivalently, as "10111".
`public enum Point{  Love, Fifteen, Thirty, Forty, Advantage, Win}public class TennisGame{  public bool Legal { get; private set; }  public Point[] Score { get; private set; }  public TennisGame()  {    Score = new Point[2];    Reset();  }  public void Reset()  {    Score[0] = Score[1] = Point.Love;    Legal = true;  }  public bool Play(string points)  {    Reset();    foreach (var point in points)      WinPoint(point & 1);    return Legal && GameOver;  }  private void WinPoint(int player)  {    if (GameOver)      Legal = false;    switch (Score[player])    {      case Point.Love:      case Point.Fifteen:      case Point.Thirty:        Score[player]++;        break;      case Point.Forty:        switch (Score[1 - player])        {          case Point.Love:          case Point.Fifteen:          case Point.Thirty:            Score[player] = Point.Win;            break;          case Point.Forty:            Score[player] = Point.Advantage;            break;          case Point.Advantage:            Score[1 - player] = Point.Forty;            break;          default:            Legal = false;            break;        }        break;      case Point.Advantage:        Score[player] = Point.Win;        break;      default:        Legal = false;        break;    }  }  public bool GameOver  {    get { return Score[0] == Point.Win || Score[1] == Point.Win; }  }}`
These two approaches are essentially equivalent in terms of not just the infinite set of game histories they'll validate, but also in runtime behaviour. That's because .NET converts regular expressions into finite state machines prior to execution (either at runtime, or when compiled with the RegexOptions.Compiled flag). It is left as an exercise for the student to extend both of these approaches to cover whole sets, matches and tournaments, with and without tiebreakers!

1. The number of n-point games, g(n), is:

g(n) = 2 * C(n-1, n-4), 4 <= n <= 6;
g(n) = 5 * 2^(n/2 - 1), 6 <= n, n even;
g(n) = 0 otherwise,

where C(n, k) is the binomial coefficient "n choose k", and "^" denotes exponentiation.

n : g(n)
=====
4 : 2
5 : 8
6 : 20
7 : 0
8 : 40
9 : 0
10 : 80
11 : 0
12 : 160
13 : 0
14 : 320

Added as a comment because the Blogger WYSIWYG editor destroys the format of my indented code upon every save, and in just such a surprising way as to make Very Challenging the task of pre-stressing it, so that it collapses into the desired state.

2. I can remember trying to solve this problem in first year of Comp. Sci :) brought back a lot of memories :)