I’ve been interviewing over the last few weeks for a new position (If anyone has a need for a very experienced .Net developer with some recent Ruby/Rails/Javascript/Coffeescript/JQuery etc chops, shoot me an email!)
At any rate, one company had an interesting SQL challenge.
In a nutshell, given a table with an ID field and a FLAG bit field, find all the consecutive rows where the FLAG field is 1 (ie a bad record).
The trick was that the output needed to look like:
ID RANGE Error Count
4000-6000 2001
First, lets insert a CSV file into a table….
BULK INSERT Flags FROM 'c:\dev\darin\sqltest\sqltest.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW = 2 ) GO
They supplied a 1.6 million row sample table, and expected sub 10 second response time. Sooooo, you also might want to add a clustered index to the ID field.
Let’s see some SQL!
With the sample table in place, I could get down to business.
I’ll admit, I had to google the syntax of the OVER and PARTITION clauses in T-SQL. They’re not something I use every day.
First, I needed all the rows, with a group number grouping all the consecutive rows who’s flag value was 1. I used a Common Table Expression for this. I could have used temp tables and it would have resulted in about the same performance, but… what the hey!
WITH parts AS ( SELECT *, id - ROW_NUMBER() OVER (ORDER BY id) AS grp FROM flags WHERE flag = 1 )
Next, I needed the count of each group as part of the result set….
WITH counts AS ( SELECT *, COUNT(*) OVER (PARTITION BY grp) AS cnt FROM parts )
With those two CTEs in place, all that was left was the output Select….
SELECT cast(min(id) as varchar(20)) + '-' + cast(max(id) as varchar(20)) as IDRange, cnt as ErrorCount FROM counts WHERE cnt >= 1000 GROUP BY grp, cnt ORDER BY IDRange
In this case, I only cared about blocks of errors of more than 1000 rows. The concatenated SELECT is just to make a nice output format. realistically, I’d probably only select MIN(ID), MAX(ID), CNT
I haven’t used CTE’s a lot, but I’m finding more and more uses for them.
And the OVER clause is something I really need to get more familiar with.
For the 1.6 million row source table, I was getting sub 2 second response times on this query. No doubt, there’s ways to speed this up, but that’s not too shabby.