Thinking All The Way Around a Query

As so often happens, I had an issue at work today that turned into (I think) a great blog/video idea.  Let's start with a simple scenario: We have two tables, each 2 columns wide, each with a 2 column composite key – meaning that (col1, col2) make up the unique primary key together.  We want to insert all rows from Table1 into Table2 except the duplicates.  Setup:

— Create tables and indexes
CREATE TABLE table1 (col1 INT, col2 INT)
GO
CREATE UNIQUE CLUSTERED INDEX PK_table1 ON table1 (col1, col2)
CREATE TABLE table2 (col1 INT, col2 INT)
GO
CREATE UNIQUE CLUSTERED INDEX PK_table1 ON table2 (col1, col2)

— Populate Table1
INSERT INTO table1 VALUES (1, 1)
INSERT INTO table1 VALUES (2, 3)
INSERT INTO table1 VALUES (4, 5)
INSERT INTO table1 VALUES (6, 7)
INSERT INTO table1 VALUES (8, 9)
INSERT INTO table1 VALUES (8, 10)

— Populate Table2
INSERT INTO table2 VALUES (1, 1)
INSERT INTO table2 VALUES (2, 3)
INSERT INTO table2 VALUES (8, 0) 

 There, lovely.  We already see that we have a duplicate row, (2,3) in each table. So if we try

INSERT INTO table2
SELECT col1, col2
FROM TABLE1 t1

Then we're going to violate the primary key.  We can get around that with a simple WHERE and subquery, BUT

INSERT INTO table2
SELECT col1, col2
FROM TABLE1 t1
WHERE t1.col1 NOT IN (SELECT col1 FROM table2)

…the problem is that we're only checking col1.  The above INSERT statement does not insert two rows we wanted, (8,9) and (8,10) because the WHERE clause filtered them out (based on Table2's row (8,0)).  So okay, we know we can't shotgun it and violate PK, and we can't do a NOT IN on col1.

One correct answer (there's always more than one, right) is a correlated subquery:

INSERT INTO table2
SELECT col1, col2
FROM TABLE1 t1
   WHERE NOT EXISTS (SELECT col1, col2
   FROM table2
   WHERE col1 = t1.col1 AND col2 = t1.col2)

This subquery selects rows from table2 that match (correlate to) rows in the parent query.  But then in the parent query WHERE clause, we specify NOT EXISTS.  To translate SQL to English: Insert all rows from table1 that DO NOT EXIST in table2.  Perfect!

Now this whole thing came about originally because I had to do this exact same thing, only from XML to a table, and the XML got me all befuddled.  I just don't work with XML much these days.  It was the process of thinking through – "hey, XML data is data, just like tables or flat files" – that got me breaking the problem down into table to table logic. So to celebrate, let's reverse-reverse engineer this back to XML! 

First, go back and truncate/repopulate Table2.  We're not rebuilding Table1, because the XML file is our data source now.  Next, let's just get this whole XML thing working:

declare @XmlDocument XML
— Here's our little XML document, just 2 rows for starters:
SET @XmlDocument=N'<ROOT><Table1 col1="16" col2="8"></Table1><Table1 col1="12" col2="5"></Table1></ROOT>'
 
DECLARE @XmlDocumentHandle int
 
EXEC sp_xml_preparedocument
  @XmlDocumentHandle OUTPUT ,
  @XmlDocument
 
— Execute a SELECT statement using OPENXML rowset provider.
INSERT INTO Table2
SELECT col1, col2
FROM OPENXML (@XmlDocumentHandle, '/ROOT/Table1', 2)
  WITH (col1 int '@col1',
    col2 int '@col2')

That'll successfully insert 2 rows into Table2. Okay, truncate/repopulate Table2 again, and this time we'll go for real:

declare @XmlDocument XML
SET @XmlDocument=N'<ROOT><Table1 col1="1" col2="1"></Table1><Table1 col1="2" col2="3"></Table1><Table1 col1="4" col2="5"></Table1><Table1 col1="6" col2="7"></Table1><Table1 col1="8" col2="9"></Table1><Table1 col1="8" col2="10"></Table1></ROOT>'
 
DECLARE @XmlDocumentHandle int
EXEC sp_xml_preparedocument
  @XmlDocumentHandle OUTPUT ,
  @XmlDocument
 
— Execute a SELECT statement using OPENXML rowset provider.
INSERT INTO Table2
SELECT col1, col2
FROM OPENXML (@XmlDocumentHandle, '/ROOT/Table1', 2)
WITH (col1 int '@col1',
  col2 int '@col2') t1
WHERE NOT EXISTS (SELECT col1, col2
    FROM table2
    WHERE col1 = t1.col1 AND col2 = t1.col2)

Now all we've done is marry the subquery logic that we worked out on our little tables, to the XML input (recognize that exact WHERE NOT EXISTS?).  Run this, and you fully populate your Table2 with no duplicates, and we have won the admiration of our peers and the gratitude of nations.

Here's the video, named Insert Non-Duplicate Rows.

-Jennifer McCown, http://www.midnightdba.com

Edited 4/16/10 to correct a typo.