The Go To GuyHardly a day goes by without someone asking me a question about regular expressions.
Scale This! Chris is one such customer, although not one of my
regulars, so to speak. Recently, Chris has also started asking me to explain, in layman's terms, such things as special and general relativity, and quantum mechanics.
Why me?! I'm certainly no relativity guru, and could just as easily imagine asking
him, or almost anyone else for that mass-energy, to explain it just once more to
me!But on the subject of quantum mechanics, I guess the day is approaching, when those of us presently engaged in high level control of electron flow (software writers, in other words) will have to get with the quantum program, or get a new career in retail greeting. With that in mind, I've just finished reading
Quantum Computation and Quantum Information by Michael A. Nielsen (University of Queensland) and Isaac L. Chuang (IBM / Stanford University). Since buying this book it's taken me six years to plow through its 675 pages, but I think in this field at least, I might at last be suitably qualified to discuss just the very basics.
Warning! Metaphorical trainwreckage ahead!This coming apocalypse will be the greatest ever sea change in the history of practical computing. One thing that might help us weather the storm, would be a source of analogies and comparisons to known references. It was a total blast recently to find a beautiful example of just such an analogy, hiding in plain sight, in the formal language subject of regular expression parsing.
All clear!The Go GuyRuss Cox, one of the people behind Google's new programming language
Go - and apparently played here by the famous mathematical genius Will Hunting - has just this week completed the third article in his (hopefully!) three part series on the history, theory, and programming practice of regular expressions. Parts
one and
two have been available since Jan 2007 and Dec 2009 respectively; and inevitably, the sudden and unexpected appearance of part
three has forced me to revisit those earlier articles, to refresh the old volatile storage units.
So here's the deal. You've probably heard that unlike classical "bits", each of which is at any given time in either its '0' state or its '1' state, the unit of quantum computation, the "qubit", can somehow be in more than one state at a time. It's said to be in a
superposition of states, which is a kind of blurry combination of 0 and 1. It's only when we make a
measurement, aka an
observation, also what we may now begin calling a
computation, that the quantum state or
wave function "collapses" into either 0 or 1.
How does this help us compute stuff? Well, it all has to do with the
preparation of the initial state of the qubit. I'll be going into that in quite a lot of detail in another article soon, but for now, just know that we
prepare a superposition state of qubits, then later we
read the evolved state.
So, what could all of this possibly have to do with regular expressions? The connecting tissue is of course this idea of
state.
Going BackRegular expression parsers are best modelled by finite automata:
machines with
states. Sometimes these are
deterministic; given the
current state and the
next input character, we can always determine uniquely the
new state of the machine. Sometimes they are not. It's these
non-deterministic finite automata, or NFAs, that interest us here.
Russ's first example uses the pattern
A(BB)+A, which matches any input string containing an
A, followed by one or more pairs of
Bs, then finally terminated by another
A. This machine is deterministic; as we scan the input string, say
ABBBBA, one character at a time, we move from our
starting state, into a
found A state, then a
found AB state (i.e., found an
A followed by an odd number of
Bs), then
found ABB (an even number of
Bs), then back to
found AB, forward again to
found ABB, then at last upon reading the terminating
A from the input string, we end up in a final
success - match found state.
Whenever we find that the next state is
not uniquely determined by the combination of the current state and the next input character, we can place markers in the pattern and the input string, make an arbitrary selection of the new state, and proceed just as before to see if we reach a match. If we fail, then we can use these markers to
backtrack to the earlier decision point, make an alternative choice of new state, and retry. Eventually we'll find a match, or run out of input, or else we'll have tried every available route through the labyrinth, without reaching that final
success - match found state.
Go Faster!From a computational viewpoint, it's this backtracking that's the problem. It can lead to your regex matching code taking an exponential amount of time to decide
match or
no match.
In 1968 Ken Thompson (of "Thompson and Ritchie"
Unix fame), in his CTSS text editor QED, introduced programmers to a superior theoretical approach. And when I say superior, well, just look at Russ's graphical comparison of execution time against
x, where the test pattern (superscripts representing repetition) is
A?xAx, and the input string is
Ax. For the absolute avoidance of doubt: when
x=3, the pattern is
A?A?A?AAA, and the input string is
AAA.
The upper, red curve represents the worst case performance of a backtracking algorithm, of the type which is still widely used today, for example in Java, Perl, PHP, Python, and Ruby. The lower, blue curve represents Ken Thompson's
Nondeterministic Finite Automata. And why didn't Russ plot them on the same graph? Look at the vertical axes. One is in seconds, the other microseconds.
So that's settled then, we should all be using Thompson NFA in our Regex implementations. But just before we do, exactly what is its secret anyway? How does it avoid the exponential explosion?
Go QuantumThe secret is to do away with backtracking entirely. Whenever we come to a fork in the road, and we can't guess which turn may lead to a successful match, we take
all available turns simultaneously. In other words, we choose the new state to be the set of all available, possible new states. Then the simple, plodding algorithm proceeds as before, but now effectively following multiple threads at once. What matters is simply whether any one of these threads reaches the final
success - match found state, or whether we reach a total mismatch blockage, or run out of input.
Multiple new states may coalesce back into one, split again, and so on. But at any given point, the
set of current states represents
everywhere in state machine space that we could have legally reached, given the input so far. Exactly how we got there is unimportant; we only care that
there was a way. So we're not actually following a bunch of threads. The threads have dissolved. All we ever have is that set of current states.
And then, if one of these states does eventually transition to the final
success - match found state, the bubble bursts; the wave function collapses; an experimental determination is made.
Go On! You're Having A Laugh!Of course, this is nothing more than an analogy, designed to help us get our heads around one particular aspect of quantum computation. As soon as you begin to push it, it breaks down. More generally, all attempts to simulate the operation of the quantum realm using classical systems are doomed to failure and disappointment, as I'll be recounting soon in the
forthcoming review of the forementioned 10 year old book that took me 6 years to read
(note: if you change "quantum" to "parallel", then the new analogy holds much more water). But as an adjunct to a layman's introduction to the spooky realm, I hope that like me, you will find it quite apposite.
If you're particularly interested in his (beautiful!) coverage of regular expressions, here are the links to Russ's entire series:
- Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)
- Regular Expression Matching: the Virtual Machine Approach
- Regular Expression Matching in the Wild
Only the
first third of the
first article is needed to illustrate the preceding ideas. So like I say, if you think you could benefit from another diagram or two, click through right now.