Thinking All the Way Around a Query: Bonus Material

While I was filming the video to Thinking…Query, named Insert Non-Duplicate Rows, I got the question: instead of a correlated subquery, couldn't you just use a join with a negative condition?  So I included that discussion in the video. 

To review: the problem we're trying to resolve is inserting rows FROM Table1 INTO Table2, ignoring duplicates.  The problem with using a JOIN is that it's looking for rows that satisfy the JOIN condition….that is, it wants to resolve to true.  Trying to join on a negative condition goes against the nature of JOINS.  Let's explore that idea.

Here's our data:

Table1
1, 1
2, 3
4, 5

Table2
1, 1
2, 3
8, 0

And here's our attempt at inserting nonduplicate rows with a JOIN:

select * from table2
INSERT INTO table2
SELECT t1.col1, t1.col2
FROM TABLE1 t1
JOIN table2 t2 on t2.col1 <> t1.col1
     AND t2.col2 <> t1.col2

So let’s go through the logical step-by-step of how SQL Server will process this join.  1. Compare first row from Table1 (col1 = 1, col2 = 1) against the first row in Table2 (col1 = 1, col2 = 1). 
Our join criteria is
t2.col1 <> t1.col1 AND t2.col2 <> t1.col2 . When you plug these values into the join criteria, it reads “1 <> 1 AND 1 <> 1”.  This is clearly FALSE, so SQL moves on to the next step.2. Compare the first row from Table1 (col1 = 1, col2 = 1) against the second row in Table2 (col1 = 2, col2 = 3).  This reads, “2 <> 1 AND 3 <> 1”.  Now, THIS is true, so the JOIN condition has been satisfied. Therefore, the query attempts to insert the Table1 row (1,1).  But this violates our primary key constraint, and is the opposite of what we want to do.

That’s really all there is to say about that. Why can’t we use a join with not equals? Because it won’t work.  

-Jen McCown, http://www.MidnightDBA.com 

Thinking All The Way Around a Query

As so often happens, I had an issue at work today that turned into (I think) a great blog/video idea.  Let's start with a simple scenario: We have two tables, each 2 columns wide, each with a 2 column composite key – meaning that (col1, col2) make up the unique primary key together.  We want to insert all rows from Table1 into Table2 except the duplicates.  Setup:

— Create tables and indexes
CREATE TABLE table1 (col1 INT, col2 INT)
GO
CREATE UNIQUE CLUSTERED INDEX PK_table1 ON table1 (col1, col2)
CREATE TABLE table2 (col1 INT, col2 INT)
GO
CREATE UNIQUE CLUSTERED INDEX PK_table1 ON table2 (col1, col2)

— Populate Table1
INSERT INTO table1 VALUES (1, 1)
INSERT INTO table1 VALUES (2, 3)
INSERT INTO table1 VALUES (4, 5)
INSERT INTO table1 VALUES (6, 7)
INSERT INTO table1 VALUES (8, 9)
INSERT INTO table1 VALUES (8, 10)

— Populate Table2
INSERT INTO table2 VALUES (1, 1)
INSERT INTO table2 VALUES (2, 3)
INSERT INTO table2 VALUES (8, 0) 

 There, lovely.  We already see that we have a duplicate row, (2,3) in each table. So if we try

INSERT INTO table2
SELECT col1, col2
FROM TABLE1 t1

Then we're going to violate the primary key.  We can get around that with a simple WHERE and subquery, BUT

INSERT INTO table2
SELECT col1, col2
FROM TABLE1 t1
WHERE t1.col1 NOT IN (SELECT col1 FROM table2)

…the problem is that we're only checking col1.  The above INSERT statement does not insert two rows we wanted, (8,9) and (8,10) because the WHERE clause filtered them out (based on Table2's row (8,0)).  So okay, we know we can't shotgun it and violate PK, and we can't do a NOT IN on col1.

One correct answer (there's always more than one, right) is a correlated subquery:

INSERT INTO table2
SELECT col1, col2
FROM TABLE1 t1
   WHERE NOT EXISTS (SELECT col1, col2
   FROM table2
   WHERE col1 = t1.col1 AND col2 = t1.col2)

This subquery selects rows from table2 that match (correlate to) rows in the parent query.  But then in the parent query WHERE clause, we specify NOT EXISTS.  To translate SQL to English: Insert all rows from table1 that DO NOT EXIST in table2.  Perfect!

Now this whole thing came about originally because I had to do this exact same thing, only from XML to a table, and the XML got me all befuddled.  I just don't work with XML much these days.  It was the process of thinking through – "hey, XML data is data, just like tables or flat files" – that got me breaking the problem down into table to table logic. So to celebrate, let's reverse-reverse engineer this back to XML! 

First, go back and truncate/repopulate Table2.  We're not rebuilding Table1, because the XML file is our data source now.  Next, let's just get this whole XML thing working:

declare @XmlDocument XML
— Here's our little XML document, just 2 rows for starters:
SET @XmlDocument=N'<ROOT><Table1 col1="16" col2="8"></Table1><Table1 col1="12" col2="5"></Table1></ROOT>'
 
DECLARE @XmlDocumentHandle int
 
EXEC sp_xml_preparedocument
  @XmlDocumentHandle OUTPUT ,
  @XmlDocument
 
— Execute a SELECT statement using OPENXML rowset provider.
INSERT INTO Table2
SELECT col1, col2
FROM OPENXML (@XmlDocumentHandle, '/ROOT/Table1', 2)
  WITH (col1 int '@col1',
    col2 int '@col2')

That'll successfully insert 2 rows into Table2. Okay, truncate/repopulate Table2 again, and this time we'll go for real:

declare @XmlDocument XML
SET @XmlDocument=N'<ROOT><Table1 col1="1" col2="1"></Table1><Table1 col1="2" col2="3"></Table1><Table1 col1="4" col2="5"></Table1><Table1 col1="6" col2="7"></Table1><Table1 col1="8" col2="9"></Table1><Table1 col1="8" col2="10"></Table1></ROOT>'
 
DECLARE @XmlDocumentHandle int
EXEC sp_xml_preparedocument
  @XmlDocumentHandle OUTPUT ,
  @XmlDocument
 
— Execute a SELECT statement using OPENXML rowset provider.
INSERT INTO Table2
SELECT col1, col2
FROM OPENXML (@XmlDocumentHandle, '/ROOT/Table1', 2)
WITH (col1 int '@col1',
  col2 int '@col2') t1
WHERE NOT EXISTS (SELECT col1, col2
    FROM table2
    WHERE col1 = t1.col1 AND col2 = t1.col2)

Now all we've done is marry the subquery logic that we worked out on our little tables, to the XML input (recognize that exact WHERE NOT EXISTS?).  Run this, and you fully populate your Table2 with no duplicates, and we have won the admiration of our peers and the gratitude of nations.

Here's the video, named Insert Non-Duplicate Rows.

-Jennifer McCown, http://www.midnightdba.com

Edited 4/16/10 to correct a typo.

Case Study: Rebuilding Indexes

It's not all that uncommon to start a new job, or inherit a project, and find out that index maintenance hasn't exactly been a priority.  One of the first things you want to do is look at fragmentation. 

This query will pull back all the indexes for My_Database ordered by the average fragmentation in percent. (If you're not familiar with sys.dm_db_index_physical_stats,  take a look at DMVs in Books Online.) 

DECLARE @dbid TINYINT
SELECT @dbid = DB_ID ('My_Database') 
SELECT OBJECT_NAME(object_id) [IndexName]
  ,* 
FROM sys.dm_db_index_physical_stats (@dbid, NULL, NULL, NULL, NULL) 
ORDER BY page_count desc, avg_fragmentation_in_percent DESC 

You'll notice that I ordered by page count first, instead of fragmentation first.  I'm simply more interested in high fragmentation on big indexes than I am in indexes that only have four pages.  I may have 87% fragmentation on an 8 page index, but that means quite a lot less than 34% fragmentation on a 3,000 page index.

Note: I highly recommend saving this resultset somewhere, for comparison with the post-reindex readings. Management love before-and-after numbers.

So from here, let's build our reindex script.  We have several options here. We could drop and recreate indexes, recreate with DROP_EXISTING, or use REBUILD or REORGANIZE.  Our considerations include:

  • Today we're looking at a system that hasn't had index maintenance in months (if ever), so we want to completely rebuild the physical indexes. 
  • The particular DB I'm looking at isn't very big – the largest table is something like 300,000 rows, and the whole database is smaller than 15Gb – and doesn't have excessive indexes (average of two or three  per table). 
  • AND I have the down time after hours to run the rebuild scripts without worrying about blocking.

So, I've decided to rebuild all the indexes on all tables that have at least one highly fragmented index. In other words, if Table1 had 3 indexes, and one of those is 60% fragmented, I'll rebuild all the indexes for Table1.  Here's a bit of code that will accomplish that:

— Script index rebuilds
USE My_Database
GO
DECLARE @dbid TINYINT
SELECT @dbid = DB_ID ('My_Database') 
SELECT 'ALTER INDEX ALL ON ['  + OBJECT_NAME(object_id) + '] REBUILD WITH (FILLFACTOR = 90)'
FROM sys.dm_db_index_physical_stats (@dbid, NULL, NULL, NULL, NULL)
WHERE avg_fragmentation_in_percent > 24

ORDER BY avg_fragmentation_in_percent DESC

This will give us a list of ALTER INDEX statements for the tables that have at least one index fragmented above 24%.  That's not a magical cutoff point, it's just a good starting place. Depending on your situation, you might choose to rebuild all your indexes, or not as many.

I also set the fill factor for each index to 90%.  Again, that's just a starting place.  Some of these tables aren't modified often, if at all, so I'll change those to 0%.  Some of these tables are large (respectively), modified often, and highly fragmented.  I might  change some of those fill factors to 85, or even 80%. 

So now all you have to do is run your script in your test environment, and then schedule a time for prod. After your rebuilds run, remember to run that first query again and save the results for your before-and-after numbers.

Hey, we just solved a huge part of our performance problem in five minutes!  Another day, we'll tackle the next biggest parts: unnecessary cursors and 20-way joins.