Hardly a day goes by without someone asking me a question about regular expressions. Scale This! Chris is one such customer, although not one of my regulars, so to speak. Recently, Chris has also started asking me to explain, in layman's terms, such things as special and general relativity, and quantum mechanics. Why me?! I'm certainly no relativity guru, and could just as easily imagine asking him, or almost anyone else for that mass-energy, to explain it just once more to me!
But on the subject of quantum mechanics, I guess the day is approaching, when those of us presently engaged in high level control of electron flow (software writers, in other words) will have to get with the quantum program, or get a new career in retail greeting. With that in mind, I've just finished reading Quantum Computation and Quantum Information by Michael A. Nielsen (University of Queensland) and Isaac L. Chuang (IBM / Stanford University). Since buying this book it's taken me six years to plow through its 675 pages, but I think in this field at least, I might at last be suitably qualified to discuss just the very basics.
Warning! Metaphorical trainwreckage ahead!
This coming apocalypse will be the greatest ever sea change in the history of practical computing. One thing that might help us weather the storm, would be a source of analogies and comparisons to known references. It was a total blast recently to find a beautiful example of just such an analogy, hiding in plain sight, in the formal language subject of regular expression parsing.
The Go Guy
Russ Cox, one of the people behind Google's new programming language Go - and apparently played here by the famous mathematical genius Will Hunting - has just this week completed the third article in his (hopefully!) three part series on the history, theory, and programming practice of regular expressions. Parts one and two have been available since Jan 2007 and Dec 2009 respectively; and inevitably, the sudden and unexpected appearance of part three has forced me to revisit those earlier articles, to refresh the old volatile storage units.
So here's the deal. You've probably heard that unlike classical "bits", each of which is at any given time in either its '0' state or its '1' state, the unit of quantum computation, the "qubit", can somehow be in more than one state at a time. It's said to be in a superposition of states, which is a kind of blurry combination of 0 and 1. It's only when we make a measurement, aka an observation, also what we may now begin calling a computation, that the quantum state or wave function "collapses" into either 0 or 1.
How does this help us compute stuff? Well, it all has to do with the preparation of the initial state of the qubit. I'll be going into that in quite a lot of detail in another article soon, but for now, just know that we prepare a superposition state of qubits, then later we read the evolved state.
So, what could all of this possibly have to do with regular expressions? The connecting tissue is of course this idea of state.
Regular expression parsers are best modelled by finite automata: machines with states. Sometimes these are deterministic; given the current state and the next input character, we can always determine uniquely the new state of the machine. Sometimes they are not. It's these non-deterministic finite automata, or NFAs, that interest us here.
Russ's first example uses the pattern A(BB)+A, which matches any input string containing an A, followed by one or more pairs of Bs, then finally terminated by another A. This machine is deterministic; as we scan the input string, say ABBBBA, one character at a time, we move from our starting state, into a found A state, then a found AB state (i.e., found an A followed by an odd number of Bs), then found ABB (an even number of Bs), then back to found AB, forward again to found ABB, then at last upon reading the terminating A from the input string, we end up in a final success - match found state.
Whenever we find that the next state is not uniquely determined by the combination of the current state and the next input character, we can place markers in the pattern and the input string, make an arbitrary selection of the new state, and proceed just as before to see if we reach a match. If we fail, then we can use these markers to backtrack to the earlier decision point, make an alternative choice of new state, and retry. Eventually we'll find a match, or run out of input, or else we'll have tried every available route through the labyrinth, without reaching that final success - match found state.
From a computational viewpoint, it's this backtracking that's the problem. It can lead to your regex matching code taking an exponential amount of time to decide match or no match.
In 1968 Ken Thompson (of "Thompson and Ritchie" Unix fame), in his CTSS text editor QED, introduced programmers to a superior theoretical approach. And when I say superior, well, just look at Russ's graphical comparison of execution time against x, where the test pattern (superscripts representing repetition) is A?xAx, and the input string is Ax. For the absolute avoidance of doubt: when x=3, the pattern is A?A?A?AAA, and the input string is AAA.
The upper, red curve represents the worst case performance of a backtracking algorithm, of the type which is still widely used today, for example in Java, Perl, PHP, Python, and Ruby. The lower, blue curve represents Ken Thompson's Nondeterministic Finite Automata. And why didn't Russ plot them on the same graph? Look at the vertical axes. One is in seconds, the other microseconds.
So that's settled then, we should all be using Thompson NFA in our Regex implementations. But just before we do, exactly what is its secret anyway? How does it avoid the exponential explosion?
The secret is to do away with backtracking entirely. Whenever we come to a fork in the road, and we can't guess which turn may lead to a successful match, we take all available turns simultaneously. In other words, we choose the new state to be the set of all available, possible new states. Then the simple, plodding algorithm proceeds as before, but now effectively following multiple threads at once. What matters is simply whether any one of these threads reaches the final success - match found state, or whether we reach a total mismatch blockage, or run out of input.
Multiple new states may coalesce back into one, split again, and so on. But at any given point, the set of current states represents everywhere in state machine space that we could have legally reached, given the input so far. Exactly how we got there is unimportant; we only care that there was a way. So we're not actually following a bunch of threads. The threads have dissolved. All we ever have is that set of current states.
And then, if one of these states does eventually transition to the final success - match found state, the bubble bursts; the wave function collapses; an experimental determination is made.
Go On! You're Having A Laugh!
Of course, this is nothing more than an analogy, designed to help us get our heads around one particular aspect of quantum computation. As soon as you begin to push it, it breaks down. More generally, all attempts to simulate the operation of the quantum realm using classical systems are doomed to failure and disappointment, as I'll be recounting soon in the forthcoming review of the forementioned 10 year old book that took me 6 years to read (note: if you change "quantum" to "parallel", then the new analogy holds much more water). But as an adjunct to a layman's introduction to the spooky realm, I hope that like me, you will find it quite apposite.
If you're particularly interested in his (beautiful!) coverage of regular expressions, here are the links to Russ's entire series:
- Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)
- Regular Expression Matching: the Virtual Machine Approach
- Regular Expression Matching in the Wild