I’ve been working on contract for a month or so now, helping to speed up some back end database summarization activity that had gotten slow enough that it was threatening to bleed into the next day’s time frame. Yuck!
Mostly standard stuff, tweaking indexes, ditching cursors, etc.
But one problem had me scratching my head today.
Essentially, I had a table of Clients, each one of which could be linked to any number of “Exposure” records, each of those having a start and stop date of exposure.
The trick was to determine how many total years of exposure a client had.
The thing is, each client might have multiple exposure records with overlapping (or not) time frames. So essentially, the problem boiled down to collapsing all the exposures to a single sequential list of non-overlapping exposure timeframes. From there, it’s trivial to just add up the differences of the date for each time frame.
But how to get there?
Cursors
The existing code was working fine, but took upwards of 40+ minutes. Essentially, it worked via cursors and functions (with more cursors) to collect all the years of all the timeframes for each client, convert them to a list of singular year elements, then convert that to a recordset and finally count up the entries. Workable, but terribly slow.
Skinning the Cat
I’d done something similar ages ago for a medical billing system, so I knew this kind of manipulation could be fast. But I’d long since forgotten exactly how I’d done it.
However, a few Google searches and I landed on Peter Larsson’s blog post about collapsing date ranges using what he calls the “Clustered Index Update”. It’s 3 years old, but definitely something worth throwing in your bag of SQL tricks!
First, create some test data:
create table #test( id int, seq int, d1 datetime, d2 datetime) insert into #test select 1, null, '2005', '2006' union all select 1, null,'2007', '2009' union all select 2, null,'2001', '2006' union all select 2, null,'2003', '2008' UNION ALL SELECT 3, null,'2004', '2007' UNION ALL SELECT 3, null,'2005', '2006' UNION ALL SELECT 3, null,'2001', '2003' UNION ALL SELECT 3, null,'2002', '2005' UNION ALL SELECT 4, null,'2001', '2003' UNION ALL SELECT 4, null,'2005', '2009' UNION ALL SELECT 4, null,'2001', '2006' UNION ALL SELECT 4, null,'2003', '2008'
Next, make sure you have a clustered index across the ID and both Date fields:
CREATE CLUSTERED INDEX ix_id ON #test (ID, d1, d2) with fillfactor = 95
Be sure that the SEQ field is initialized to NULL or 0 (already done via the population code above).
Then, create several variables to assist with counting through the records to set the SEQ field. Use a SELECT to initialize those variables:
DECLARE @id INT, @Seq INT, @d1 DATETIME, @d2 DATETIME SELECT TOP 1 @Seq = 0, @id = id, @d1 = d1, @d2 = d2 FROM #test ORDER BY id, d1
The Trick
Finally, update the SEQ column using the “Clustered Index Update” trick:
UPDATE #test SET @Seq = CASE WHEN d1 > @d2 THEN @Seq + 1 WHEN id > @id THEN @Seq + 1 ELSE @Seq END, @d1 = CASE WHEN d2 > @d2 THEN d1 WHEN id > @id THEN d1 ELSE @d1 END, @d2 = CASE WHEN d2 > @d2 THEN d2 WHEN id > @id THEN d2 ELSE @d2 END, Seq = @Seq, @id = id
Essentially, what’s happening here is that since the update doesn’t specify an order, SQL will update via the physical order in the database, which is the same as the clustered index (a clustered index determines the physical ordering of records in the table). And since the records are ordered in ID, D1, D2 order, the SEQ column will be updated with an incrementing number that effectively clusters overlapping ranges together.
Since the records are already physically in that order, this update happens lightning fast because there’s no need to perform any index lookups.
You can see the end result by selecting all the records at this point:
select * from #test
Now, the data is ready, you just have to query it using that SEQ column. For instance, this SELECT will retrieve the start and end date of each non-overlapping cluster of dates belonging to each ID.
SELECT ID, MIN(d1) AS d1, MAX(d2) AS d2 FROM #test GROUP BY id, Seq ORDER BY Seq
Mr. Larsson also describes a query to retrieve the “gaps” (or missing date ranges), which could be handy for a different class of problem.
If, like me, you also need a grand total of the number of years, first, you can get the years in each collapsed timeframe and then get the grand total years, like this:
select ID, Years = Sum(Years) From ( SELECT ID, Years=Year(MAX(d2)) - Year(Min(d1)) + 1 FROM #test GROUP BY id, Seq ) a Group by id Order by id
Using this trick took this particular query (actually a sizable set of queries and cursor loops) from 40+ minutes to under a minute, with virtually all of that minute being spent filtering down the set of records that I needed to specifically collapse the timeframes on (in other words, doing stuff unrelated to actually collapsing the date ranges). In all, several million records being processed in a few seconds now.
Good stuff.