Monday, 21 February 2011

Security Testing Part 2 - Lab Work

Jarlsberg Turns to Gruyere

As of July 13, 2010, the Google Codelab formerly known as Jarlsberg is now called Gruyere. This training aid is a knowingly vulnerable application, which can be used to learn and understand web vulnerabilities.
This is a particularly useful experience for test engineers, since one of its main aims is to simulate the activity of penetration testing - simulating attacks from malicious sources (known as Black Hat Hackers, or occasionally Crackers). The process involves an active analysis of the system for potential vulnerabilities resulting from poor or improper system configuration, from known and unknown hardware and software flaws, or from operational weaknesses in process and/or technical countermeasures.

Gruyere is excellent at what it does, in its tutorial-structured, hint-led way. The contents sidebar is the best reference to what classes and detailed subtypes of vulnerability are covered. Cross-Site Scripting, Script Inclusion and Request Forgery (XSS, XSSI, XSRF) are particularly well served, pardon the pun. Other inclusions include: data tampering, information disclosure, denial of service, remote code execution and elevation of privilege.

For each vulnerability, you assume the role of a black hat (malicious hacker) and uncover the related exploit. One particularly well realised aspect of this Codelab is the way in which you are encouraged - just like a real world security researcher - to combine approaches of both major kinds: black box (e.g. probe it with some bad input data) and white box (e.g. read its source code).

What Can't Gruyere Do?

There are some significant omissions. For starters, any popular C/C++ attack with "overflow" in the name. Gruyere is written in Python, which by design, prevents any attempt at reading and writing outwith an array's bounds. For that reason, the popular attack known as buffer overflow is not covered by this lab. Similarly, Python prohibits integer arithmetic overflow, and so Gruyere includes no examples of integer overflow exploits.

It is important to note that these observations do not guarantee the immunity of the website from such attacks. All applications are exposed to so-called platform vulnerabilities. These can be security weaknesses in the web browser or other client side code, or in the underlying Python runtime. The lab doesn't cover such issues, even though sometimes you might avoid platform vulnerabilities by making changes to the app, so as to alter its platform dependencies, or its resource usage.

By far the biggest omission from this lab has to be SQL Injection - for the admittedly very good reason, that Gruyere does not use SQL. Again, there are plenty of well researched cases elsewhere on the web. For example this very readable account by Steve Friedl is a model of clarity.

Microsoft 's extensive and very professionally produced Security Virtual Labs are also well worth a look. This one on SQL Injection Vulnerabilities can take up to 90 minutes to complete (the standard allowance for these labs). It requires JavaScript plus "IE6 or above"; and you'll have to install an ActiveX control to connect to and run the lab.

A Note On Scope

While security testing obviously needs to be targeted on a specific set of possible exploits, given the vulnerabilities and their mitigations in the application or system under test, the umbrella undertakings of security training, and more generally education, must entertain no such restrictions.

Take Banned APIs for example. These are closely related to the subject of buffer and integer overflows described above, and to other vulnerabilities specific to code in languages like C/C++ but not found for example in managed code. Why should a Test Department dealing with C# WinForms apps care about these? - Well, it could happen at any time, that the Development Department will decide to slip some C/C++ code, or a legacy library, into the mix. Maybe to avoid injecting a .NET or Windows Shell dependency somewhere. Suddenly, your attack surface quadruples. Shouldn't you be prepared for that? Certainly, to the extent that you should immediately and instinctively be at least aware of the implications of that change.

Another example. Why worry about XSS attacks, when no such vulnerabilities can be exploited, due to the particular nature of our app? - Because tomorrow, the Marketing Department will demand that lucrative third party ads be served up in IFrames beside your primary content. Next day, you could be serving malware as blithely as the BBC. Clearly, just-in-time training is woefully inappropriate here.

Finally, here is the definitive word on the issue from Michael Howard, architect of Microsoft's Security Development Lifecycle, and principal security program manager on their Trustworthy Computing (TwC) Security team. In his article The SDL and the CWE/SANS Top 25 Most Dangerous Programming Errors 2010, he writes:
Even CWE 98, "PHP File Inclusion," is covered by the SDL in our required security training classes, which is especially remarkable when you consider that virtually no PHP code is written at Microsoft!

The reason that we address issues like PHP file inclusion in the SDL is that we don't simply wait for new vulnerability taxonomies to be released and then rush to add mitigations to our security processes; rather, we structure the SDL to provide developers with fundamentally sound, secure programming practices. As a result, we cover not just the known vulnerabilities of today (like the Top 25) but also many of the unknown vulnerabilities that will be discovered tomorrow. The fact that all of the Top 25 are addressed by the SDL is a great validation, but it is the result of the content of our process and not the cause.
Now Hack!

Once you've read the introductory material, including the warning paraphrased in red below, you can start your Gruyere adventure here:
One final thing, do remember to concentrate exclusively on the suggested attacks. Seriously. Any deviation from the vulnerability cases which Google have expressly authorized here could have dire consequences of almost unlimited badness for you.

Part 1 - Overview

Sunday, 20 February 2011

Security Testing Part 1 - Overview

Fundamental Practices for Secure Software Development

Two weeks ago on February 8, 2011, the Software Assurance Forum for Excellence in Code (SAFECode) published the 2ND EDITION of their paper Fundamental Practices for Secure Software Development - A Guide to the Most Effective Secure Development Practices in Use Today (2MB PDF). Their stated ambition for their report (original 2008 edition) was " help others in the industry initiate or improve their own software assurance programs and encourage the industry-wide adoption of what we believe to be the most fundamental secure development methods."

Rather than a comprehensive guide to all possible secure development practices, their concise, actionable and pragmatic report provides a foundational set of these; a set that has been effective in improving software security in real-world implementations by SAFECode members across diverse development environments. They call these “practiced practices”, meaning they are actually employed by SAFECode members, having been identified through an ongoing analysis of members’ individual software security efforts, and are currently in use at leading software companies.

CWE References, Verification, and Resources

Before going into detail about the section dedicated to Testing Recommendations, notice that all subsections are bookended by these three bullets:
  • CWE References: originally created by MITRE Corporation, Common Weakness Enumeration references provide a unified, measurable set of software weaknesses - a universal basis for an extended technical vocabulary, similar in this respect to the utility of software design patterns in development - enabling and encouraging effective discussion, description, selection and use of software security practices. By mapping their recommended practices to CWE, the authors provide a detailed illustration of the security issues these practices aim to resolve, and a precise starting point for interested parties to learn more.
  • Verification: usefully, each subsection includes a list of methods and tools that can be used to verify whether a given practice was applied. This is aimed at checking whether development teams are actually following prescribed security practices!
  • Resources: self explanatory; books, articles, reports, tools, tutorials, in short anything that can usefully be combined with the foregoing report text to expand on it in any way.
And So To Test

For security testing and verification, you'll want to head for page 39, and the section helpfully entitled Testing Recommendations. Here you're reminded more than once, that the goal of testing activities is not to add security by testing, but instead to validate the robustness and secure implementation of a product, reducing the likelihood of security bugs being released and discovered by customers and/or malicious users.

This and other preliminaries dispensed, there then follow the four subsections unique to security testing and verification recommendations:
  1. Determine Attack Surface. Which is to say, understand the attack surface, with the aid of a good, up-to-date Threat Model, combined with such tools as port scanners, or Microsoft's Attack Surface Analyser; and your knowledge of all the program's inputs, determined from requirements & design, and supplemented by information about protocols and parsers as supplied by development.
  2. Use Appropriate Testing Tools. Consider which fuzz testing tools, vulnerability scanners, and other resources can be mobilised to uncover programming errors, known vulnerability classes, and administrative issues. Which of these can be automated? What should be the level of exploratory testing, using say network packet analyzers, and network or web proxies that allow man-in-the-middle attacks and data manipulation?
  3. Perform Fuzz / Robustness Testing. This is currently a fast changing area of automated security testing, seeing new research and advancement almost daily. Test departments are identifying software development training requirements, in spite of the growing availability of off-the-shelf fuzz testing tools for standard protocols and general use, because of custom file and network data formats used by the applications under test. Effort needs to be focused on the particular networking protocols or data formats in use, and on the high priority, high exposure entry points that have been identified during the threat modelling stage, as being available to attackers.
  4. Perform Penetration Testing. Which is expensive, and is often partly or wholly outsourced to professional penetration and security assessment vendors. But an in-house penetration test resource or team can maintain a very valuable advantage, from one test to the next, based on the availability of internal product knowledge.
A Sample Agenda

These five pages 39-43 of the SAFECode report supply us with most of the headings we need to form a starting agenda for an introduction to security testing.

  • Integrity, Availability, Confidentiality (CIA)
  • Threat Modelling
  • Attack Surface
  • Inputs, Protocols and Parsers
  • Fuzz / Robustness Testing
  • Vulnerability Classes (SQL Injection, XSS)
  • S.T.R.I.D.E.
  • Common Weakness Enumeration (CWE)
  • Vulnerability Analyzers
  • Network / Web Proxies
  • Port Scanners
  • Packet Analyzers

Just to reiterate (and to paraphrase one of the report's authors, the SDL's Michael Howard), this paper's unique importance is its description of what SAFECode members are doing in practice, to raise the security bar. It is deeply pragmatic, not a theoretical or academic document. SAFECode is also actively seeking public comment on this paper, especially in the verification sections. So if you know of specific tools or techniques to help determine if a software development team is adhering to the practices, please let them know.

Wednesday, 16 February 2011

Auntie Beeb's Virus

BBC Music Websites Are Infectious

Did your antivirus software detect the BBC 6 Music / 1Xtra driveby?

According to this Virustotal scan, currently only 12 of the top 43 antivirus products correctly identify Tuesday's malware threat, which at the time of writing, is still actively serving up malicious executables from IFrame tags on these popular BBC streaming sites. In cases like these, the simple act of visiting a website is sufficient to cause infection.

Kaspersky did detect this threat, which is good news for us, both at work and at home. On the other hand we are far from complacent, noting among the failures such high profile names as AVG, BitDefender, McAfee (all editions), Microsoft and Sophos. All companies whom we have used, endorsed, and recommended to our customers and families, at one time or another. Today, I can't bring myself to link to them... nor obviously to those BBC music websites! Update (Feb 17): all of the above have now caught up, and the latest Virustotal figure is 23/43.

Here is the Websense Security Labs blog entry on the attack, which identifies the malware as having been authored using the still popular PEK toolset (Phoenix Exploit Kit, 2007).

Tuesday, 15 February 2011

Regex Tennis

Game History Validation

This example from the Universe of Regular Expressions illustrates a substantial, real world application of the "AB alphabet" type, famous from countless introductions and tutorials. But this example didn't arise at work. It came up while we were relaxing on holiday one summer, and watching Wimbledon. Suddenly I started sketching state diagrams on the backs of Embo postcards...

Suppose during a tennis match we want to record more than just game and set scores; we'd like to record the detailed sequence of points won and lost in each game. So for example, if player A won a "love game" (in which player B failed to score at all), we might record the four points that she won in this format:
If instead she lost just one point out of five, but went on to win, then the game will be represented by one of these:
depending upon whether she lost the first, second, third or fourth point (she can't have lost the final fifth point, since she did win the game). Similarly if she lost two points, well, those could be any two out of the first five points. Applying binomial coefficients, "five choose two" = ten possibilities, and the game's history will be one of these:
Now suppose we wish to validate such game histories. Can we use a Regex to determine whether a given sequence of As and Bs represents a legal game of tennis? Well, the scoring in any sport can be represented by a finite state machine, so yes, a Regex can certainly validate a game of tennis. But before proceeding, it's worth mentioning that Regex has no built-in support for permutations. That sometimes makes it the wrong tool for jobs like this one, which may have rather long solutions as a consequence. Solution sizes are exponential in the alphabet size, to be exact. Our alphabet has just two letters, so we'll persevere for now.

The examples given so far represent full, legal games. Illegal examples include such things as A, AA, AAA (these games are still in progress), or AAAAB (player A has already won the game before B "wins" that impossible final fifth point). In fact, when taken together with their opposites, i.e. the corresponding cases where B wins instead of A, the foregoing 15 cases already exhaust all 30 possibilities for what we'll call a short game (up to six points). So, by disjoining ( | ) all of the foregoing examples and their opposites, then forcing a full game match by delimiting the result with ^ and $, we can obtain a canonical pattern capable of validating all short games:
But we can also do a lot better. Here is a rough sketch of the state transition diagram for an arbitrary game. For simplicity, tiebreak games are excluded from this treatment, but aside from the game length, their analysis is essentially the same. Scoring on the diagram proceeds from left to right, except when forced back (arrows) from Adv_A or Adv_B to Deuce. The transition is upward whenever A wins the point, downward when B.

Our so-called short games correspond to all the valid paths through this diagram, from Start to Win_A or Win_B, and avoiding Deuce. It so happens that the analysis used here is better explained in a later part of this problem, so for now I'll just pull out of my hat this improved pattern, which matches all 30 (and only those) short games:
Next we address the remaining long games. After six equally shared points, the score is "40-40", aka "Deuce". For either player to win the game from this point, she must score a further two consecutive points, making it therefore a game of 8 (or 10, or 12, ...) points in total. Now, applying binomials again, there are "six choose three" = twenty ways to reach this intermediate Deuce state, beginning at Start. This portion of the game history comprises 3 As and 3 Bs, intermixed in any one of these 20 possible sequences.

Divide and Conquer!

Cut the play in half. If there are 3 As in the first half, then there must be 3 Bs in the second:
Or if there are 2 As and a B in the first half, then the second must comprise some combination of one more A and 2 Bs:
And so on. There are only four such partitions, and when we gather them together in a disjunction, we obtain the pattern matching any sequence of play from Start to Deuce:
The long game pattern is completed by tacking on to this stem the playoff stage, in which the winner concludes with two consecutive points. This is simply any number (possibly zero) of AB or BA pairs, followed by a final AA or BB. Converting these words into symbols:
To obtain the final Regex pattern of a tennis game, take the disjunction of the patterns above for matching short and long games:
An Alternative Approach

The above is a complete solution, and despite the pattern lengths involved, still a practical one, since these patterns or equivalents can easily be autogenerated. However, the autogeneration process is exactly equivalent to walking through the state diagram. If we can do that, then we already have a finite state machine capable of game history validation.

Class TennisGame below is one example of such a state machine, encoding the scoring rules of tennis games. It has a Play method which accepts one string parameter purporting to be a game history, and returns the validation result as a boolean. Any two adjacent characters can be used for the alphabet, so for example, a game history can be written as ABAAA, or equivalently, as "10111".
public enum Point
Love, Fifteen, Thirty, Forty, Advantage, Win

public class TennisGame
public bool Legal { get; private set; }
public Point[] Score { get; private set; }

public TennisGame()
Score = new

public void Reset()
Score[0] = Score[1] =
Legal = true;

public bool Play(string points)
foreach (var point in points)
WinPoint(point & 1);
return Legal && GameOver;

private void WinPoint(int player)
if (GameOver)
Legal = false;
switch (Score[player])
case Point.Fifteen:
case Point.Thirty:
case Point.Forty:
switch (Score[1 - player])
case Point.Love:
case Point.Fifteen:
case Point.Thirty:
Score[player] =
case Point.Forty:
Score[player] =
case Point.Advantage:
Score[1 - player] =
Legal = false;
case Point.Advantage:
Score[player] =
Legal =

public bool GameOver
get { return Score[0] ==
Point.Win || Score[1] == Point.Win; }
These two approaches are essentially equivalent in terms of not just the infinite set of game histories they'll validate, but also in runtime behaviour. That's because .NET converts regular expressions into finite state machines prior to execution (either at runtime, or when compiled with the RegexOptions.Compiled flag). It is left as an exercise for the student to extend both of these approaches to cover whole sets, matches and tournaments, with and without tiebreakers!

Sunday, 13 February 2011

Emotronic Happy Hardcore

They Come From Art Schools

Tightly knit groups of new, high talent. Sometimes they are called Girl School (no, not that one). They fuse abundant skills, folding into their stage acts divers abilities from the performing arts and others. Their musical composition, arrangements, pathos and humour emerge with effortless high quality, recombine and transcend pop culture. They mesmerise, and you swoon. You tell everyone you know, everyone you meet about them. You begin stalking them - once an essay in frustration, now an easy vice, enabled by this age of universal omniscience - you collect every copy of every scrap of every note they play, write, think.

But their incorporation is a callous, uncaring, opportunistic arrangement of convenience. One that will tear out your heart, when their project serves its only purpose, when their vehicle advances their individual creative, artistic, and social growth and development. Tear it out and stake it, rip in half and quarter, turn to dry brown dust. You will rage at this world's injustice, to allow so monstrous a disbanding. Damn you Girl School, you completely ruined my whole life when I was seventeen. And now it's happening again.

Futuristic Retro Champions


We accidentally caught the best Art School Band since Roxy Music (my description) in September 2009, when we dropped in to King Tut's Wah Wah Hut to see Charlotte Hatherley. The second support act Futuristic Retro Champions, we later learned, emerged from Edinburgh College of Art in 2006, playing their debut gig in that city's famous Wee Red Bar. What they do couldn't be spelt out more clearly in their name. Five years and uncountable plaudits later, they are from left to right:

Cecilia "Ceal" Stamp - bass, vocals.
Harry Weeks - guitar, vocals, synths, production.
Carla Easton - keys, vocals, occasional saxophone.
Sita Pieraccini - vocals and melodica, occasional tambourine.
This lot are what happy hardcore should have been [...] this is one of the most exciting Scottish bands I have heard in ages.
- Gavin Cumine, Broken English

This band never fail to fill our hearts with colourful sparks of joy. It’s sugary, happy hardcore twee-pop, and all defiant with it.
- The Skinny

With a pure pop sound resplendent with hook-laden choruses they are an engaging and energetic ball of fun, and their whirlwind live show has attracted heaps of praise from press and musical peers alike.
- Gary Flockhart, The Scotsman

There are many more raving acclamations where those came from, but what's the point. The appalling truth is this: despite having upcoming gigs in Glasgow and Edinburgh this April, timed to coincide with their debut 2CD release "Love And Lemonade", Futuristic Retro Champions have already split. Postponed from last December when Ceal shattered her elbow, these are their farewell dates; and the new CD, well that's their final Retrospective. It's already too late to tell you how witty and catchy and varied and original are their songs, how soaringly and technically and sometimes achingly beautiful are their speciality harmonies. If you don't catch their penultimate April 8 show at Glasgow's Captain's Rest, or the final April 9 at Edinburgh's Wee Red Bar where it all began, then your whole life too will be completely ruined, for nothing now can ever come to any good.

Look I Made You A Mixtape (Like A Mathematician)

I called it Bootleg, it's on the Verbatim label, it's all one band. Sorry I can't send it on; the British government would have me arrested and my house severed from the Internet, causing me to die for sharing with you, the music that I love. Also, I want my favourite artists always to be fully compensated for their work. But look, you can get all the tracks from the links below. Some of them you'll have to pay a few pennies for, many others are free. Then to pull it all together from this Scotch Broth of MP3s, FLACs and YouTube vids (and it has to be a CD, because: well, retrochamps, right?) you'll need a few audio tools.
  1. Speak To Me (3:50)
  2. Epic New Song (4:21)
  3. Pulling Box Shapes (2:47)
  4. Isn't It Lovely (4:11)
  5. You Make My Heart (3:37)
  6. DIY Lovesong (2:46)
  7. Let's Make Out (2:32)
  8. Told Ya (4:07)
  9. Jenna (4:08)
  10. Kitten With A Loaded Gun (3:04)
  11. Strawberries And Vodka Shots (2:47)
  12. Told Ya (TYGH Remix) (3:59)
  13. May The Forth (3:36)
  14. Settle Down (4:01)
  15. Robert De Niro's Waiting (free) (4:10)
  16. May The Forth (Miaoux Miaoux) - Glasgow PodcArt link (3:57)
  17. DIY Lovesong (live) (2:51)
  18. Jenna (Live at the Mill) (4:09)
  19. Nintendo (YouTube) (3:16)
  20. Strawberries And Vodka Shots (original demo) (2:46)
  21. Uh Oh (No Show) - Ceal's lead vocal debut! (2:30)
Tracks 1-4 are from the Lollipoptastic EP.
Tracks 5-8 are from the FRC EP.
Tracks 9-12 are from the LaChunky EP (FREE).
Tracks 13-15 are from the May The Forth / Settle Down single.

That should keep you going, something to play in the car for the next couple of months, until the 25-track official, retrospective release arrives. The Lollipoptastic EP appears to have disappeared almost completely from the web. I hope the track ordering on April's 2CD release will be a wee bit like mine. I hope the full Lollipoptastic EP is included. I hope there's some new (previously unreleased) demo material on it too, like maybe the early 'Hi!' and 'Lullaby'.

I hope...

Hey, I hope Harry gets his solo project running, and the girls start the new band that we've been promised, but soon!

Still... yes, of course, they must escape their homunculi. The real enemy of this piece is neither the hapless art school student, nor society, nor ambition, nor ambivalence. It is time. It is change itself. These young adults, without exception, already have healthy careers right now outwith the FRC; and yet still more promising future prospects. But still. Wouldn't that be something to see, their mooted reunion, 10 years hence?

Update (Feb 26): being of the firm opinion that such a band deserves one, I created a Futuristic Retro Champions Wikipedia page. And after an initial request for speedy deletion, summarily dismissed, I'm pleased to say my fellow Wikipedians appear to agree.

Tuesday, 8 February 2011

Electricity - Part 1

Primordial Communications

I was eight years old when my gran bought me the Meccano Elektrikit. After building all the motors, buzzers, bells, and other projects in the manual (including the fantastic, fully functional Telegraph Receiver with Bell and Morse Key, and the brilliant Electric Shock Machine!), I began looking for still more practical applications. I was almost eleven by the time I'd decided to drill a hole through my bedroom floor into the downstairs kitchen ceiling, and push through a pair of wires. These then became a simple serial circuit, comprising a battery and small lamp upstairs, and a switch downstairs.

Now at last I could play Mungo Jerry's In The Summertime, or Hotlegs' Neanderthal Man, at a decent volume on the Dansette. And when dinner was ready, mum - rather than having to come upstairs and bang on the bedroom door - could use her kitchen switch to let me know. Provided, of course, I just happened to be looking directly at the lamp at that precise moment (it wasn't much bigger or brighter than a single fairy light).

More often I'd be freaking out, kneeling eyes closed with a 12½" Meccano girder in each hand, thrashing the bed; the frustrated drummer, performing for his myriad adoring fans. Poor mum would patiently tap her switch until eventually the song ended, and I would notice the signal. Clearly this situation necessitated the introduction of a second channel of communication, pointing in the opposite direction, to let her know when I'd got the message.

The Commons

Before pushing through another pair of wires to support this "back channel", I sketched out the full design - wires, switches, batteries and bulbs - on to my first circuit diagram. On inspection it struck me that here were two copies of the original circuit, behaving completely independently of each other; they should still work correctly if electrically joined at any single point. And, should that point happen to be one of the wires travelling through the floorboards, why then I could physically remove one of those four wires, and get by with just three.

Wait just another minute though. Suppose that common point was the '-' terminal of each of my 4½V batteries. Call that point "zero Volts". Then the '+' terminals of the batteries would both be at the same 4½V level above this. In other words, there's no voltage difference between them; were I also to connect the '+' terminals of my batteries together, no current would flow across the join. But adding this wire meant that I'd have just one battery powering the whole system, instead of two. It's fun discovering things like that for yourself!

A string of enhancements soon followed. Two more series bulbs were added, guarding against filament failure and reassuring the sender that a message was in fact being transmitted. When I discovered diodes, the battery was replaced by the 13V AC output of my Hornby model railway transformer, and the number of connecting wires reduced to two. And then to one, when I co-opted the household mains Earth wire for the return path: yes, I guess I was a preteen criminal.

Next time: the unique sound-to-light system of Blak Ice Disco.

Dansette image courtesy of / © The British Library Board.

Simple Regex #2: Validation

What's a Valid XML Element Name?

Sometimes we want to convert customer entered data to XML. Sometimes we want to use it for an element name. Obviously it'll need some sanitising, so what should we escape? The XML RFC is a wee bit twisty on this question, its section on Start Tags defining a Name roughly, i.e. ignoring so-called combining characters and extenders, like this:
NameStartChar ::= Letter | ‘_’ | ‘:’
NameChar ::= NameStartChar | Digit | ‘.’ | ‘-’
Name ::= NameStartChar (NameChar)*
That's similar to the definition of an identifier in many languages, but with the addition of a few specific punctuation marks. The twist comes when you consider namespaces. The XML Names recommendation states that these assign a meaning to names containing colon characters, and that therefore, authors should not use the colon in XML names except for namespace purposes. Even though XML processors must still accept the colon as a valid name character, as per the above syntax, it gives off the odour of a practice to avoid. So we go with this:
NameStartChar ::= Letter | ‘_’
NameChar ::= NameStartChar | Digit | ‘.’ | ‘-’
Name ::= NameStartChar (NameChar)*
No Colons Then?

That's right. Our element names start with a letter or underscore, then continue with any number of these, possibly in combination with digits, periods, and hyphens. To put it another way (inexactly, but in practice acceptably, for my purpose): an element name is any nonempty sequence of word characters (letters, numbers, underscores), periods, and hyphens; and it must start with either a letter or an underscore.

In the interests of localization, rather than the parochial a-zA-Z_0-9, we should use the Regex word character class \w to represent, erm, word characters. That just leaves the period and hyphen to be mopped up in the main sequence. Similarly, when it comes to specifying the initial letter, rather than a-zA-Z, we should use the letter class \p{L} built for just this purpose:
private static string ToElementName(string input)
  // Replace all non-hyphen/period/word characters with underscores.
  var result = new StringBuilder(Regex.Replace(input, @"[^-.\w]", "_"));
  // If input doesn't start with a letter or underscore, prepend an underscore.
  if (!Regex.IsMatch(input, @"^[\p{L}_]"))
    result.Insert(0, '_');
  // Done.
  return result.ToString();
A point to note about the first pattern [^-.\w] is that neither the hyphen nor the period need be escaped. Within brackets, the period represents itself, rather than being a wildcard; and the hyphen is similarly literal (as opposed to indicating a range) when it appears as the first item in a set.

Other Useful Character Classes

Why yes, there are some others, I'm glad you asked. These two are probably the droids you're looking for: \p{Lu} for uppercase letters, and \p{Ll} for their lowercase comrades. For the full story about Character Classes in C#, go to

Tuesday, 1 February 2011

Tweets - January 2011