A recent update to our flagship product takes advantage of the server-side paging capabilities of the DevExpress Grid control. Naturally, this change has involved the migration of much client-side C# code into server-side table changes, triggers, and stored procedures, all written in SQL. One of the casualties was a particularly hirsute C# method which used to sort table contents into what's sometimes called "Natural Order", so that e.g. DC10 would come after DC9, rather than the naive collation or ASCII order, which would have them reversed.
For reasons unknown it fell to me to implement our Natural Sort. Easy I thought, I'll ask Google for the answer. That's when I discovered there's really no such thing as The Natutral Sort Order. It depends on your particular data, and your specific requirements, to a frankly surprising degree. Of course, there is plenty of material to choose from on the web. Jeff Atwood's Coding Horror on the subject is quite a good central station for your exploration of the matter. But this plenitude is also a bit of a problem. After an hour or two of research, I'd decided my best plan was to design and implement my own, newly reinvented wheel.
Fields
The basic approach is to partition or "stripe" the input (unsorted) data into a sequence of fields, alternating alphabetical content with numerical, and then treat each field as its own Sort By column - sorting the alpha fields as normal, and the numeric fields numerically. In the above DC9 / DC10 example, this results in an initial alpha field containing "DC" in both cases, followed by a numeric field containing the integers 9 and 10, which are then subjected to a numerical sort.
Some of the examples I'd read performed this latter sort by actually converting the input data field into an integer, then using the language's numeric comparison capabilities. I didn't want to use that approach, because a 32-bit signed integer can only be used for field sizes up to 9, a 64-bit one 18, and so on. I had no specification to assure me customer data wouldn't exceed such an arbitrary bound, so I fell back on the old workaround of keeping the numeric fields in string form, but left-padding them with zeros until all the rows were the same length, allowing an alpha sort to do the job of a numeric one. This is essentially part of what we do when we adopt the ANSI date format, e.g. 2013-09-29, to get correctly ranked dates.
Collations
Notice that it doesn't matter which padding character you use to right-align the numeric fields, just as long as it doesn't come after 0 (zero) in the collation order. This is important later.
One fun fact I found while researching collations was that aa comes after ab in the Danish / Norwegian collation order. For historical-typographical reasons, aa is treated (while sorting) as identical to the letter å, which just happens to be the last letter of their shared alphabet. Never mind; we'll just have to assume anyone using a particular collation order knows what they're doing, and won't be surprised by results which after all should in all cases appear perfectly non-anomalous to them.
Field Sizes
Okay, so now we have this requirement to right-justify our numeric fields by left-padding them with e.g. zeros. What fixed field size should we pad them out to? Well our input data, being stored in a traditional, relational database, has some particular maximum string length. In my case that length was 200. There's nothing to stop our customers filling every single character space with a decimal digit. So we could be looking at a numeric field of width 200.
What about the alpha fields? These don't require padding, since standard left-to-right string comparisons work fine for them. But note that every alpha character "uses up" one position in the input data, rendering that location unavailable for storing a subsequent digit. Long story short, we can stuff any alpha prefix into the left edge of our 200-character field, and still have enough room to fit and pad out the remaining, right-justified, numeric content.
For performance reasons, the ultimate destination for this calculation was a "live" Sort Order column in the relevant database table, rather than a UDF call at SQL SELECT time. That's why my buffer size had to be allocated statically, rather than optimised with reference to the worst case requirements of the data actually in the table; we didn't want a new row of data invalidating the precomputed sort orders of the old data.
Islands and Seas
You might worry about non-digit characters encroaching on the padding space of our numeric fields, and you'd be right to. Actually everything works okay as long as we stick to letters and digits. Anomalies can start to appear when we introduce punctuation characters, especially when using ASCII collation. The numeric digits and the upper- and lower-case letters can be seen to form three disconnected "islands" of codes, surrounded by four seas of punctuation characters.
In practice, these anomalies are mitigated by our customers' sparse use of such punctuation, and tendency to apply it consistently whenever it is used at all. As a further mitigation, I changed the padding character from a zero digit to a space, ensuring that padded-out numeric fields are essentially guaranteed to sort lower than any alpha or other character found in the same region.
Example
The following, correctly sorted data can be seen to illustrate these adjustments. Notice, in the Natural Sort column, the use of the space character as filler, and the fact that the 'h' of 'Coach' occupies the same character column position as the '1' of 'Van 1234' without causing any problem:
Simple Sort Natural Sort Coach 12 Coach 2 Coach 2 Coach 12 Van 1234 Van 234 Van 234 Van 1234
Field Count
Obviously the input data might contain further alternating fields of non-numeric and numeric data. What should we do with subsequent fields? Well, we just package them up in pairs using exactly the same algorithm as the first, and when we have done this enough times to consume all of the input, return the concatenation of all these field blocks as the sortable key.
There is another minor optimisation available here. Obviously the first field pair must have consumed at least one character from the input. This means that the field "buffer" for the second pair can be made one character shorter than the first - say, 199 characters instead of 200. Likewise, if input characters remain after the second or subsequent field pair have been extracted, then that pair must have consumed at least two characters, so the next buffer size can be 197, or 195, or 193, or...
Yes, quite. The law of diminishing returns cuts in quite promptly here, especially since we decided that a total of 3 field pairs would be more than adequate for our customers' requirements (actually there is an auto-generation element in our product, designed to nudge them into using just a single field pair: an alpha prefix, followed by a numerical index). So I just left all my buffers at width 200. You should obviously make use of this optimisation if your input size limit is much lower than 200, or if you decide to use a lot more significant field pairs.
Coding and Testing
This works adequately and was ultimately accepted for production, but I must acknowledge here the excellent work done by our Test Department, both in testing my new UDF out-of-sequence when I asked - quite unreasonably - for an early comparison with the old client-side C# function (come to think of it, how the hell did they even do that?), and also in uncovering pretty quickly all of the corner cases - punctuation, padding characters - mentioned above.
Oh, the code? Yeah sure, here ya go. Should be quite easy to hunt down the few rogue "200"s to adapt it for your use. You might also want to limit the number of iterations to prevent DOS attacks (we use only 3). As I said at the outset, it's still surprisingly specific and might not work for you. For example, it doesn't recognise decimal points / numeric separators. The truth is, there simply does not exist one Natural Sort Algorithm suitable for all data.
CREATE FUNCTION [dbo].[NaturalSort](@input NVARCHAR(200)) RETURNS NVARCHAR(MAX) AS BEGIN DECLARE @count INT = LEN(@input), @result NVARCHAR(MAX) = '', @p INT = 1, @q INT, @x INT, @y INT WHILE @p <= @count BEGIN SELECT @x = PATINDEX('%[0-9]%', SUBSTRING(@input, @p, @count) + '0') - 1 SELECT @q = @p + @x SELECT @y = PATINDEX('%[^0-9]%', SUBSTRING(@input, @q, @count) + '!') - 1 SELECT @result = @result + SUBSTRING(@input, @p, @x) + REPLICATE(' ', 200 - @x - @y) + SUBSTRING(@input, @q, @y) SELECT @p = @q + @y END RETURN @result END
No comments:
Post a Comment