Thursday, 17 June 2010

Common Table Expressions

Waiter, there's a Table on the Fly

Common Table Expressions (CTEs) have been around for a while, since having been introduced into SQL Server 2005. Basically they provide a way to create a temporary working table, stuffed with useful calculated data, which can then be accessed from an adjoined query just as though it were a real physical table in the database.

Now, the particular power of Microsoft CTEs comes from the fact that they can be built recursively. This fact has made them a very popular choice for interrogating hierarchical, tree-structured data, for example in tables using the ID : ParentID method of auto foreign keys. A few lines, a mere couplet of SQL, can conjure up a temp table containing any given row together with all of its descendants.

The question of how best to implement storage of hierarchical, tree-structured data is an ancient one, and there have of course been many solutions to it. Joe Celko's SQL For Smarties describes in detail an alternative, fascinatingly mathematical approach with some important performance benefits. I have used this on some occasions, recursive CTEs on others, and still other alternative approaches as appropriate when non-functional requirements have dictated.

Lighten Up

But I've probably been a bit blinded by the efficacy of CTEs in this particular context, to their other potential uses in simpler scenarios. This occurred to me today, when a colleague asked for some advice in building certain dashboard metrics as SQL queries. After a little too much black coffee, I found that while part of my brain idled along a path of conventional JOINs and nested SELECTs, another part instead began to wonder how a procedurally oriented developer, habitually reluctant to think in relational terms, might be able to build such a compound query out of its perceived components.

Set-relational operations are combined in a fundamentally different way from sequential ones. This has ever been clear to the SQL school, but is only now, with the advent of LINQ, beginning to receive much more attention from the procedural. It seemed to me at that moment, that CTEs might have a role to play in this ongoing adoption.

Quite often, people might correctly identify and perform a promising initial SELECT query, projecting their data from source tables into a "shape" somewhat closer to their desired output; but then they immediately run up against SQL's unfamiliar syntax and context restrictions, as soon as they try to take step 2. This appears to be an exact analogue of the notorious "impedance mismatch" between the coding requirements of relational storage, and those of the business object environment.

But what if we simply take that intermediate projection, construct a CTE out of it, and use that as our starting point for the next step?

Composable Queries Again

In practice, there are no significant limitations upon what can be done with the CTE contents, at least as far as our procedural developer is concerned. It really can be treated as just another table, containing data structured in a slightly more goal-friendly way than was previously available. Now just lather, rinse and repeat.

The forementioned colleague's problem was: obtain a result set containing the percentages of cases successfully closed, after a given number of communications, where the communications dimension was striated into "less than two", "two or three", and "four or more".

From the database, we use only an intermediate Communication table, which connects cases to communications (so each row has an integral Comm_CaseId and Comm_CommunicationId). Here's how CTEs allow the answer to be built up piecemeal. The first CTE, "Totals", simply obtains a count of all the cases, to allow us later to calculate proportions. The second, "Comms", aggregates the cases as required by our striation, delivering each case ID in one column, together with a corresponding category letter in another. The third CTE, "Cats", then further aggregates these categories. Finally, our main SELECT combines this output with the previously computed totals to arrive at the required output:

WITH Totals(Total) AS
(
SELECT COUNT(DISTINCT Comms.Comm_CaseId)
FROM Communication Comms
),
Comms(CaseID, Cat) AS
(
SELECT
Comms.Comm_CaseId,
CASE COUNT(Comms.Comm_CommunicationId)
WHEN 0 THEN 'A' -- Category A means 0 or 1 communication(s).
WHEN 1 THEN 'A'
WHEN 2 THEN 'B' -- Category B means 2 or 3 communication(s).
WHEN 3 THEN 'B'
ELSE 'C' -- Category C means 4 or more communication(s).
END
FROM
Communication Comms
WHERE Comms.Comm_CaseId IS NOT NULL
GROUP BY Comms.Comm_CaseId
),
Cats(Num, Cat) AS
(
SELECT COUNT(CaseID), Cat
FROM Comms
GROUP BY Cat
)
SELECT
Cat AS 'Category',
100 * Num / (Total + 0.0) AS '%'
FROM Cats, Totals


Non Hollywood Ending :-(

In the end, my colleague reverted to traditional Join based ANSI SQL code for her problem, as that turned out to be easier to decorate with the auxiliary columns she needed for the purposes of her BI analysis. Still, it was an enlightening exercise.

No comments:

Post a Comment