Monthly Archives: March 2011

SQL for Deleting Records with No Related Records

0
Filed under SQL

Had another interesting SQL challenge come up today. Basically, I had two tables. I needed to delete all the records from Table A that did not have any related records in Table B.

Simple enough. The existing code was like so:

delete x
from #tmp x
left join #tmp as y on x.ID=y.ID and y.Status<>'dismissed'
where x.Status='dismissed'
    and y.ID is null

Now, in this case, Table A and Table B happen to be the same table, just interrelated, but I had similar situations with distinct tables.

Since the table in question was a temporary table, there weren’t any indexes defined, so that would account for some of the slowdown, but this query was taking forever to finish.

Just looking at the query, though, it struck me that the LEFT JOIN was likely doing way more work that necessary, which can be confirmed via a quick check of the execution plan.

The problem is that the join has to go ahead and connect all the related records that are actually related, even though, in this case, those are the very records we don’t care anything about (notice the y.Person is null clause).

I’m guessing that the optimizer can’t or doesn’t realize that we don’t actually care about those rows in this case. Not surprising, but definitely interesting.

So, a quick rewrite is in order. I chose a correlated subquery:

delete x
from #tmp x
where x.Status='dismissed'
    and not exists(
        select top 1 1 from #tmp y where y.ID = x.ID and y.Status<>'dismissed'
    )

Believe it or not, this version executed in under a second against the exact same million+ count rowset. The difference was surprising, and I’m not often surprised anymore by SQL optimization tricks.

Ok, a few things to note:

  • The table still doesn’t have any indexes. Things will likely be even faster with some good indexes on the table. But this is a temp table, so you’ll have to weigh the cost of creating the index vs the savings you get for additional queries that might use it while the table exists.
  • Since I don’t actually care about any of the related records, I select the scalar value “1” for the result of the subquery. Technically, this is not actually necessary since the NOT EXISTS clause is used, which usually causes the SQL optimizer to automatically forgo returning any actual data. I just like to be explicit about things.
  • Further, since I only want to know whether any related records exist or not, I can select only the TOP 1 related record, and get it or nothing. This allows some additional SQL optimizations that center around use of the TOP clause, which is a good thing.
  • And finally, use of the NOT EXISTS clause allows even more internal SQL optimizations.

Bottom line

Definitely keep correlated subqueries in your SQL toolbelt. This one example shaved more than 20 minutes off the execution of a particular stored proc.

Are correlated subqueries always the answer? Oh, hell no. Often, they’re the worst approach.

Are there other ways to do this? Certainly. A CROSS JOIN is the first thing that comes to mind. Establishing indexes on the temp table would certainly help as well.

But as fast as this solution is, I didn’t take things any farther.

Collapsing Date Ranges in T-SQL

0
Filed under Code Garage, SQL

imageI’ve been working on contract for a month or so now, helping to speed up some back end database summarization activity that had gotten slow enough that it was threatening to bleed into the next day’s time frame. Yuck!

Mostly standard stuff, tweaking indexes, ditching cursors, etc.

But one problem had me scratching my head today.

Essentially, I had a table of Clients, each one of which could be linked to any number of “Exposure” records, each of those having a start and stop date of exposure.

The trick was to determine how many total years of exposure a client had.

The thing is, each client might have multiple exposure records with overlapping (or not) time frames. So essentially, the problem boiled down to collapsing all the exposures to a single sequential list of non-overlapping exposure timeframes. From there, it’s trivial to just add up the differences of the date for each time frame.

But how to get there?

Cursors

The existing code was working fine, but took upwards of 40+ minutes. Essentially, it worked via cursors and functions (with more cursors) to collect all the years of all the timeframes for each client, convert them to a list of singular year elements, then convert that to a recordset and finally count up the entries. Workable, but terribly slow.

Skinning the Cat

I’d done something similar ages ago for a medical billing system, so I knew this kind of manipulation could be fast. But I’d long since forgotten exactly how I’d done it.

However, a few Google searches and I landed on Peter Larsson’s blog post about collapsing date ranges using what he calls the “Clustered Index Update”. It’s 3 years old, but definitely something worth throwing in your bag of SQL tricks!

First, create some test data:

create table #test(
   id int,
   seq int,
   d1 datetime,
   d2 datetime)

insert into #test
select 1, null, '2005', '2006' union all
select 1, null,'2007', '2009' union all
select 2, null,'2001', '2006' union all
select 2, null,'2003', '2008' UNION ALL
SELECT    3, null,'2004', '2007' UNION ALL
SELECT    3, null,'2005', '2006' UNION ALL
SELECT    3, null,'2001', '2003' UNION ALL
SELECT    3, null,'2002', '2005' UNION ALL
SELECT    4, null,'2001', '2003' UNION ALL
SELECT    4, null,'2005', '2009' UNION ALL
SELECT    4, null,'2001', '2006' UNION ALL
SELECT    4, null,'2003', '2008'

Next, make sure you have a clustered index across the ID and both Date fields:

CREATE CLUSTERED INDEX ix_id ON #test (ID, d1, d2) with fillfactor = 95

Be sure that the SEQ field is initialized to NULL or 0 (already done via the population code above).

Then, create several variables to assist with counting through the records to set the SEQ field. Use a SELECT to initialize those variables:

DECLARE    
    @id INT,
    @Seq INT,
    @d1 DATETIME,
    @d2 DATETIME
SELECT TOP 1
    @Seq = 0,
    @id = id,
    @d1 = d1,
    @d2 = d2
FROM #test
ORDER BY id, d1

The Trick

Finally, update the SEQ column using the “Clustered Index Update” trick:

UPDATE #test
SET   
    @Seq = CASE
        WHEN d1 > @d2 THEN @Seq + 1
        WHEN id > @id THEN @Seq + 1
        ELSE @Seq
        END,
    @d1 = CASE
        WHEN d2 > @d2 THEN d1
        WHEN id > @id THEN d1
        ELSE @d1
        END,
    @d2 = CASE
        WHEN d2 > @d2 THEN d2
        WHEN id > @id THEN d2
        ELSE @d2
        END,
    Seq = @Seq,
    @id = id

Essentially, what’s happening here is that since the update doesn’t specify an order, SQL will update via the physical order in the database, which is the same as the clustered index (a clustered index determines the physical ordering of records in the table). And since the records are ordered in ID, D1, D2 order, the SEQ column will be updated with an incrementing number that effectively clusters overlapping ranges together.

Since the records are already physically in that order, this update happens lightning fast because there’s no need to perform any index lookups.

You can see the end result by selecting all the records at this point:

select * from #test

Now, the data is ready, you just have to query it using that SEQ column. For instance, this SELECT will retrieve the start and end date of each non-overlapping cluster of dates belonging to each ID.

SELECT      
    ID,
    MIN(d1) AS d1,
    MAX(d2) AS d2
FROM #test
GROUP BY id, Seq
ORDER BY Seq

Mr. Larsson also describes a query to retrieve the “gaps” (or missing date ranges), which could be handy for a different class of problem.

If, like me, you also need a grand total of the number of years, first, you can get the years in each collapsed timeframe and then get the grand total years, like this:

select 
    ID,
    Years = Sum(Years)
From (
    SELECT     
        ID,
        Years=Year(MAX(d2)) - Year(Min(d1)) + 1
    FROM #test
    GROUP BY id, Seq
    ) a
Group by id
Order by id    

Using this trick took this particular query (actually a sizable set of queries and cursor loops) from 40+ minutes to under a minute, with virtually all of that minute being spent filtering down the set of records that I needed to specifically collapse the timeframes on (in other words, doing stuff unrelated to actually collapsing the date ranges). In all, several million records being processed in a few seconds now.

Good stuff.