Wednesday, 5 September 2012

Simple Regex #6: Lazy Quantifiers

Dedicated Listener

I've decided to listen to 129 classical symphonies and rate them against each other, to try to understand this musical format a little better. The project has two immediate side effects of a technical nature. I've had to:
  1. Replace the 32GB SDHC card on my phone MP3 player, a venerable Sony Ericsson Xperia X10 Mini, with the bleeding edge Sandisk 64GB SDXC, to make room for a shedload of symphonies; and
  2. Brush up on greedy quantifiers, and their lazier, more reluctant counterparts.
The former was a pleasant surprise. The phone manufacturer's specifications have always claimed microSDHC compatibility "up to 16GB", but I'd been using a 32GB card ever since the X10 Mini got shunted out of my wife's handbag by the arrival of a bouncing new baby Galaxy. Still, the jump to purchase the XC card was a bit of a leap of faith, since nobody else online appeared to have tried this particular combo. Sadly, it worked! Yes sadly, because otherwise, I'd have had a great case for a new phone...

The latter side effect occurs because I've arrived at my list of 129 symphonies under test by combining the main table of 120 listed at Digital Dream Door,
with a further 9 also-rans suggested by that forum's moderator, and compiler of the main list, whom we shall call Brian (for that is his name). Having screen scraped his raw data, converting all its en–dashes and “directional quotes” to hy-phens and "neutrals", I was left with the task of parsing these results. Here's a small sample:
 1. Symphony No. 9 in D minor "Choral" - Ludwig Van Beethoven
 2. Symphony No. 5 in C minor - Ludwig Van Beethoven
11. Symphonie Fantastique - Hector Berlioz
63. Symphony No. 3 "The Camp Meeting" - Charles Ives
For analysis, I need to extract certain data from these entries. The first, dot-terminated field is Brian's initial rank. This is followed by the full title of the symphony, which usually includes its performance key, and sometimes a popular alternate title or nickname. Finally, a hyphen sets off the composer name.

Now, it would have been easy to take multiple passes at this task, first removing any optional (key and nickname) fields present, then splitting the remaining text into the mandatory rank, root title and composer name fields. Too easy, in fact. Why do that, when we could just use a single Regex pattern to simultaneously extract the full compliment of fields, mandatory and optional?

Too Greedy

The difficulty is that the pattern capturing the symphony title could also gobble up any key and/or nickname information that might be present. Let's say for simplicity that the key field can be identified by the word in, while the nickname is delimited by double quotation marks. That gives us two optional groups:
(?: in (.+?))?
where we've nested the capturing key group (underlined) in a non-capturing group (?: )? in order to exclude the in keyword itself from the captured data. Let's try dropping these into our first attempt at a mandatory field template:
(\d+)\. (.+)(?: in (.+?))?(".+")? - (.+)
This succeeds in extracting the leftmost rank and rightmost composer name fields correctly. However, there's no key or nickname information parsed out; all of that remains resolutely embedded in the title field. We say that the + quantifier in the title group (.+) is greedy. It gobbles up as many input characters as possible, backtracking only when necessary to find an overall pattern match. If it can devour the key and nickname information without breaking the match, as here it can because those groups are optional, then it will.


Quantifiers like ?, *, + and {m,n} can be made less greedy by following them with a question mark. For example, the ? quantifier on its own matches zero or one occurrence(s) of the preceding expression, but with a preference for one. That's to say, it will gobble up input characters whenever it can. ?? also matches zero or one occurrence(s) of the expression immediately before it, but with a preference for zero. In general, these so-called lazy quantifiers consume as few input characters as possible, while still letting the overall pattern match succeed. Which turns out to be exactly what's needed in our case:
(\d+)\. (.+?)(?: in (.+?))?(".+")? - (.+)
Now the title group (.+?) consumes the minimum possible number of characters, meaning that if there are key or nickname fields present, the associated groups will now have a higher priority in extracting those from the input stream. Similarly, the identical key group (.+?) has itself been put on a diet, to stop it gobbling up any present nickname.

The final pattern that I used contained many further refinements, including named groups for readability, generalisation of white space, trimming of field contents, and stricter key field matching with a more specific group pattern,
[A-G](?: sharp| flat)?(?: major| minor)?
which actually removes the need for the second lazy quantifier above. Incidentally this also allows the word in to appear in the symphony's title in a non-key, non-nickname context - as for example in Igor Stravinsky's Symphony in Three Movements, which happens to be number 76 on Brian's list.

Further Reading, Listening

Greedy and lazy Regex quantifiers are explained in more detail at MSDN:
A good online tool for testing regular expressions containing multiple groups is Derek Slager's AJAX offering, A Better .NET Regular Expression Tester, which can be found here:
As for the listening exercise, a quick survey of my dusty old Classic FM CDs reveals that I already have good recordings of 43 56 symphonies on Brian's list, including all of his top 20 25 (and quite a few duplicates; who authorised that?). So cost wise, I'm exactly one third almost half complete. Although to be honest, I've never really liked Solti's limp wristed walkthrough of Mozart's Jupiter...

No comments:

Post a Comment