Thursday, 27 January 2011

Simple Regex #1: Metacharacters

Useful Concepts: Opacity

Oh sure, I know there's no shortage of guides to regular expression on these Internets. However, questions about them continue to land on my desk with a satisfyingly fitting regularity. So I've decided to write about them, starting with this one. Colleague C wants to parse some data from Discogs, the music database / marketplace. Credits on individual tracks follow a comma separated format like this:
Piano, Guitars [Lead, Bass, 12 String Acoustic], Drums, Kitchen Sink
followed by the artist's name. What C wants to do, is to parse each individual instrument or grouping of instruments, ending up with something like this:
Piano
Guitars [Lead, Bass, 12 String Acoustic]
Drums
Kitchen Sink
Obviously the "wrinkle" is that bracketed sections can contain commas, which we don't want to act as field separators. In fact (and this is key to the answer), we don't care at all what's inside the bracketed areas. Everything after a left bracket, up to and including the first right bracket if any, should be preserved just as it is. Such is frequently the situation with string encodings; separators and terminators are implicitly "escaped" when parenthesised. We may also term such bracketed content "opaque", meaning that our parser "cannot see inside it" - to discern, for example, those commas.

My First Metacharacter

That being the case, we might as well treat an entire bracketed expression as a single, inscrutable and indivisible (meta)character. One example of a pattern matching such an expression is:
\[[^\]]*]
which is to say, a literal left bracket "[", followed by any number of non-"]" characters (the sub pattern for such a sequence is shown underlined), followed by a final right bracket "]". For the sake of clarity and language independence I'll be omitting pattern delimiters in most of my examples; you should add your own enclosing @"" or // as appropriate for your environment. Now we can say that each match in our output is a sequence of atoms, where every atom is one or other of these:
  • any single character, which is neither a comma, nor a left bracket;
  • a bracketed expression, as recently defined above.
Here then is the pattern that matches one of our newly identified atoms:

[^,[]|\[[^\]]*]
Note the use of the OR operator "|" between the two underlined sub patterns for matching (1) a single character, and (2) a bracketed expression. Now, a match can be expressed as a sequence comprising any number of atoms:
([^,[]|\[[^\]]*])*
Here the underline spans a single atom. If we now test this solution, we find that it has two bugs. It intersperses zero-length matches between the desired ones, and it preserves spaces after commas in the input string, manifesting these as leading spaces in the second and subsequent output matches. Both issues can be addressed by insisting that a match additionally starts with at least one non-comma, non-space character (prepended and underlined below):
[^, ]([^,[]|\[[^\]]*])*
And that, assuming only that bracketed expressions aren't nested, is one final answer to the case of the Discogs Parsing Regex.

Previously: Regex Fuzzing; Quantum Regex.

No comments:

Post a Comment