Thursday 20 September 2012

Simple Regex #6½: More Lazy Quantifiers

Hush. Hush.

The security-related components of my work continue to comprise only product-specific threats and mitigation, which means I can't exactly blog about them in a public forum like this one. Instead, here's a little more on the subject of that previous application of Regular Expressions to music catalogues.

Oh and about that previous article, I have to be honest and say that I've been getting complaints! Apparently the level of explanation offered wasn't even up to my usual low standards of lucidity? Let's try to rectify that here. The goal you'll remember was to parse a list of classical music pieces like this,
49. Violin Concerto in E major, RV271 "L'amoroso" - Antonio Vivaldi
subject to the proviso that while an item's rank (here equal to 49), title (Violin Concerto) and composer name (Antonio Vivaldi) are all mandatory, the key (E major), opus/catalogue number (RV271) and nickname (L'amoroso) are all optional. Here's an analysis of the Regex pattern I'm using to split these records into fields:
private const string rank = @"(\d+)\.";
private const string title = @" (.+?)";
private const string key = @"(?: in ([A-G](?: flat| sharp)?(?: major| minor)?))?";
private const string number = @"(?:, (.+?))?";
private const string nickname = @"(?: ""(.+)"")?";
private const string composer = @" - (.+)";
private const string pattern = rank + title + key + number + nickname + composer;
The @ symbol is an artifact of the C# language. Most of its appearances above are redundant, but regardless, I do tend to use it habitually when working with Regex. It saves having to double all backslashes. So the first line
private const string rank = @"(\d+)\.";
matches one or more decimal digits (followed by a literal period, which is outside the capturing parentheses, and so doesn't itself get included in the captured group). The second line
private const string title = @" (.+?)";
matches a space (again excluded, being outside the group) followed by one or more characters of the title, but matching as few characters as possible consistent with an overall successful match.

It helps to apply these two visual filters when inspecting the various groups in patterns like the third line above:
(?: starts a non-capturing group;
)? ends an optional group.
So for example, the overall key group pattern above is both non-capturing, since it starts with (?:, and optional, since it ends with )?. Nested within it is the main capturing group (labelled c1 in the expanded analysis below) for the key text, and nested in turn within that are two further, non-capturing, optional groups, n1 and n2:
private const string n1 = @"(?: flat| sharp)?";
private const string n2 = @"(?: major| minor)?";
private const string c1 = @"([A-G]" + n1 + n2 + ")";
private const string key = @"(?: in " + c1 + ")?";
More Music Maestro!

Hopefully that's as much analysis as we need for this pattern. It's a little more complex than previously, because of the addition of this opus/catalogue number field, appearing when I generalised the listening project, originally featuring just symphonies, to include also the concerti, symphonic poems and ballets in the following lists:
That should be enough to keep HMV in business for a few more weeks! I wanted the project to include all our classical favourites, so piano concerti became a necessity (Grieg for me, Tchaikovsky #1 for the wife), as did tone poems (The Planets, for both of us) and ballets (Bolero). Anyway, the addition of opus/catalogue number field seemed like a good idea. And obviously being optional, it had to be added with yet a third lazy quantifier. Equally lazy is my detection of the number field's existence, which is triggered by the presence of a comma in the title (but after the key, if any). It would be possible to do more, since this field does have certain structure, although not as much as the key field. Noticing that only one record contained commas in the actual title,
106. Symphony No. 14 for soprano, bass, strings, and percussion – Dmitri Shostakovich
I just deleted them. Well sometimes, and particularly with Regex, the best solution is to get a life.

No comments:

Post a Comment