Wednesday, 6 April 2011

Simple Regex #3: A Failure Case

No Back Tracking

I was processing a number of Visual Studio files. A large-ish number of C# solution, project, and source files. Looking for any obsolete (or otherwise redundant) .cs file which wasn't included in a .csproj - or worse, any .csproj that wasn't explicitly built by a .sln. Now, some of these projects referred to files not in the directory subtree of the project file itself. Also, some such references employed relative paths.

For example, if the project file at

d:\project\client\startup.csproj
refers to the source file at

d:\project\engine\utils.cs
then it's likely to do so using two dots; i.e., under the guise of

..\engine\utils.cs
Obviously, appending this relative path to the client directory path

d:\project\client\
yields a perfectly serviceable, though not optimal, source file path:

d:\project\client\..\engine\utils.cs
Trouble is, my earlier directory traversal, building up the dictionary of source files, created the entry for this file using the "key":

d:\project\engine\utils.cs
So in order to normalize keys, I needed to remove the \client\.. portion of the calculated path, making these two identical. Sounds like an ideal job for a wee bit of simple Regex, no? Just remove all instances of a backslash, followed by a bunch of non-backslash characters (the redundant directory name), followed by another backslash and two periods:
return Regex.Replace(input, @"\\[^\\]+\\\.\.", string.Empty); // Bug!
No!

I puzzled for a little while over why the output of this function still contained embedded \..\, before finally adding some debug code and discovering the answer.

Before:

d:\project\client\forms\..\..\version.cs
After:

d:\project\client\..\version.cs
Why hasn't the above code removed \client\.. from this path? Because \client\.. wasn't in the original input string. When it does appear, the Regex.Replace operation (having just removed \forms\..) has already moved beyond that position in processing the input string. And it does not backtrack. Can't afford to, that would be mayhem. Imagine it trying to replace "x" by "xx" with backtracking.

~(∀x)(x is a nail)

I could work around the bug, say by detecting whether any change has been made to the input string, and if so (or rather, while so) re-running the operation. That's clearly inefficient. No, the best thing to do here is accept I've selected the wrong tool for the job - a conclusion you'll reach fairly often with Regex!

Here finally is the smart approach to this problem:
return System.IO.Path.GetFullPath(input); // Fixed.

5 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. If you want to see the regex, check the solution that I have posted in the Regex Google Group.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. For clarity: it's the commenters who are removing their own posts here, not me!

    ReplyDelete