Fun With SQL (OVER and PARTITION clauses)

Posted by admin on June 27, 2014 – 12:02 am

Filed under Hiring, SQL

I’ve been interviewing over the last few weeks for a new position (If anyone has a need for a very experienced .Net developer with some recent Ruby/Rails/Javascript/Coffeescript/JQuery etc chops, shoot me an email!)

At any rate, one company had an interesting SQL challenge.

In a nutshell, given a table with an ID field and a FLAG bit field, find all the consecutive rows where the FLAG field is 1 (ie a bad record).

The trick was that the output needed to look like:

ID RANGE        Error Count

4000-6000       2001

First, lets insert a CSV file into a table….

BULK INSERT Flags
FROM 'c:\dev\darin\sqltest\sqltest.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2
)
GO

They supplied a 1.6 million row sample table, and expected sub 10 second response time. Sooooo, you also might want to add a clustered index to the ID field.

Let’s see some SQL!

With the sample table in place, I could get down to business.

I’ll admit, I had to google the syntax of the OVER and PARTITION clauses in T-SQL. They’re not something I use every day.

First, I needed all the rows, with a group number grouping all the consecutive rows who’s flag value was 1. I used a Common Table Expression for this. I could have used temp tables and it would have resulted in about the same performance, but… what the hey!

WITH parts AS (
   SELECT *, id - ROW_NUMBER() OVER (ORDER BY id) AS grp
   FROM flags
   WHERE flag = 1
)

Next, I needed the count of each group as part of the result set….

WITH counts AS (
   SELECT *, COUNT(*) OVER (PARTITION BY grp) AS cnt
   FROM parts
)

With those two CTEs in place, all that was left was the output Select….

SELECT cast(min(id) as varchar(20)) + '-' + cast(max(id) as varchar(20)) as IDRange, 
   cnt as ErrorCount
FROM counts
WHERE cnt >= 1000
GROUP BY grp, cnt
ORDER BY IDRange

In this case, I only cared about blocks of errors of more than 1000 rows. The concatenated SELECT is just to make a nice output format. realistically, I’d probably only select MIN(ID), MAX(ID), CNT

I haven’t used CTE’s a lot, but I’m finding more and more uses for them.

And the OVER clause is something I really need to get more familiar with.

For the 1.6 million row source table, I was getting sub 2 second response times on this query. No doubt, there’s ways to speed this up, but that’s not too shabby.

Fun With SQL (OVER and PARTITION clauses)

Let’s see some SQL!

Post a Comment

Subscribe

Recent Posts

Recent Comments

Archives

Categories

Meta