I've decided to listen to 129 classical symphonies and rate them against each other, to try to understand this musical format a little better. The project has two immediate side effects of a technical nature. I've had to:
- Replace the 32GB SDHC card on my
phoneMP3 player, a venerable Sony Ericsson Xperia X10 Mini, with the bleeding edge Sandisk 64GB SDXC, to make room for a shedload of symphonies; and - Brush up on greedy quantifiers, and their lazier, more reluctant counterparts.
The latter side effect occurs because I've arrived at my list of 129 symphonies under test by combining the main table of 120 listed at Digital Dream Door,
http://www.digitaldreamdoor.com/pages/best-classic-symp.htmlwith a further 9 also-rans suggested by that forum's moderator, and compiler of the main list, whom we shall call Brian (for that is his name). Having screen scraped his raw data, converting all its en–dashes and “directional quotes” to hy-phens and "neutrals", I was left with the task of parsing these results. Here's a small sample:
1. Symphony No. 9 in D minor "Choral" - Ludwig Van BeethovenFor analysis, I need to extract certain data from these entries. The first, dot-terminated field is Brian's initial rank. This is followed by the full title of the symphony, which usually includes its performance key, and sometimes a popular alternate title or nickname. Finally, a hyphen sets off the composer name.
2. Symphony No. 5 in C minor - Ludwig Van Beethoven
11. Symphonie Fantastique - Hector Berlioz
63. Symphony No. 3 "The Camp Meeting" - Charles Ives
Now, it would have been easy to take multiple passes at this task, first removing any optional (key and nickname) fields present, then splitting the remaining text into the mandatory rank, root title and composer name fields. Too easy, in fact. Why do that, when we could just use a single Regex pattern to simultaneously extract the full compliment of fields, mandatory and optional?
Too Greedy
The difficulty is that the pattern capturing the symphony title could also gobble up any key and/or nickname information that might be present. Let's say for simplicity that the key field can be identified by the word in, while the nickname is delimited by double quotation marks. That gives us two optional groups:
where we've nested the capturing key group (underlined) in a non-capturing group (?: )? in order to exclude the in keyword itself from the captured data. Let's try dropping these into our first attempt at a mandatory field template:(?: in (.+?))? (".+")?
This succeeds in extracting the leftmost rank and rightmost composer name fields correctly. However, there's no key or nickname information parsed out; all of that remains resolutely embedded in the title field. We say that the + quantifier in the title group (.+) is greedy. It gobbles up as many input characters as possible, backtracking only when necessary to find an overall pattern match. If it can devour the key and nickname information without breaking the match, as here it can because those groups are optional, then it will.(\d+)\. (.+)(?: in (.+?))?(".+")? - (.+)
Dieting
Quantifiers like ?, *, + and {m,n} can be made less greedy by following them with a question mark. For example, the ? quantifier on its own matches zero or one occurrence(s) of the preceding expression, but with a preference for one. That's to say, it will gobble up input characters whenever it can. ?? also matches zero or one occurrence(s) of the expression immediately before it, but with a preference for zero. In general, these so-called lazy quantifiers consume as few input characters as possible, while still letting the overall pattern match succeed. Which turns out to be exactly what's needed in our case:
Now the title group (.+?) consumes the minimum possible number of characters, meaning that if there are key or nickname fields present, the associated groups will now have a higher priority in extracting those from the input stream. Similarly, the identical key group (.+?) has itself been put on a diet, to stop it gobbling up any present nickname.(\d+)\. (.+?)(?: in (.+?))?(".+")? - (.+)
The final pattern that I used contained many further refinements, including named groups for readability, generalisation of white space, trimming of field contents, and stricter key field matching with a more specific group pattern,
which actually removes the need for the second lazy quantifier above. Incidentally this also allows the word in to appear in the symphony's title in a non-key, non-nickname context - as for example in Igor Stravinsky's Symphony in Three Movements, which happens to be number 76 on Brian's list.[A-G](?: sharp| flat)?(?: major| minor)?
Further Reading, Listening
Greedy and lazy Regex quantifiers are explained in more detail at MSDN:
http://msdn.microsoft.com/en-us/library/3206d374.aspxA good online tool for testing regular expressions containing multiple groups is Derek Slager's AJAX offering, A Better .NET Regular Expression Tester, which can be found here:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashxAs for the listening exercise, a quick survey of my dusty old Classic FM CDs reveals that I already have good recordings of
No comments:
Post a Comment