Compare tables to find missing rows
Let’s talk about the case where you want to compare tables to see what’s missing. We might be comparing a list of orders to an imported set of data in a staging table. Or, we’re identifying customers in the database that aren’t on the call list. Whatever the application, finding out the difference between one table and another is a very common need.
First, a bit of setup
We’ll set up a couple of temporary tables for our example, and load them up with a few random schema and table names. That’s a nice generic set of data. If we want to imagine a scenario for this data, this is something we might use in a custom maintenance routine (e.g., “which tables in the database weren’t reindexed last week”?).
CREATE TABLE #AllTableList ( SchemaName SYSNAME , TableName SYSNAME ); CREATE TABLE #MyTableList ( SchemaName SYSNAME , TableName SYSNAME ); INSERT INTO #AllTableList ( SchemaName , TableName ) SELECT TOP 10 s.name AS SchemaName , o.name AS TableName FROM sys.objects o JOIN sys.schemas s ON o.schema_id = s.schema_id WHERE o.type = 'u'; INSERT INTO #MyTableList ( SchemaName , TableName ) SELECT TOP 5 s.name AS SchemaName , o.name AS TableName FROM sys.objects o JOIN sys.schemas s ON o.schema_id = s.schema_id WHERE o.type = 'u';
There we go, one table with “all” the data, and one table with rows “missing”. So how do we tell the difference?
This query is wrong
If we reason it out, it makes sense that you’d have to join the tables together to see where the table and schema names within are NOT equal. Let’s see if that works:
SELECT DISTINCT al.SchemaName , al.TableName FROM #AllTableList AS al INNER JOIN #MyTableList AS my ON my.TableName <> al.TableName;
Run this, and you get back 45 rows. What went wrong?
What happened is exactly what we asked for: SQL joined every schema.tablename from “All” with every one from “My” that didn’t match. Go ahead and add the “My” columns to the select list, and you’ll see what I’m talking about:
SELECT DISTINCT al.SchemaName , al.TableName , my.SchemaName , my.TableName FROM #AllTableList AS al INNER JOIN #MyTableList AS my ON my.TableName <> al.TableName;
Now in our resultset, we see that dbo.dep1 is paired with dbo.dep2, and so on. We asked for mismatches, we got that. We didn’t get any info about missing data!
One right way: OUTER JOIN
We want to look at all the data from the “All” table, whether or not it matches, and then discard the matches, because we only want tables that don’t have a match. Let’s do this in two steps.
First, to get back all the data from “All”, we need a LEFT OUTER JOIN. And don’t forget we said “whether or not it matches”…if we’re talking about matches, then we need to get matches! That means we need to use a match criteria: equals (=) instead of not equals (<>). Oh, and we’ll want to add SchemaName back into the mix….it takes a table name and a schema to qualify as a match, in this case.
SELECT DISTINCT al.SchemaName , al.TableName , my.SchemaName , my.TableName FROM #AllTableList AS al LEFT OUTER JOIN #MyTableList AS my ON my.TableName = al.TableName AND my.SchemaName = al.SchemaName; -- not quite right yet
Now, that’s more like it! We can clearly see where the table names and schemas match, and where we’re missing data!
So all that’s missing is a WHERE my.TableName IS NULL clause, and we’re golden:
SELECT DISTINCT al.SchemaName , al.TableName , my.SchemaName , my.TableName FROM #AllTableList AS al LEFT OUTER JOIN #MyTableList AS my ON my.TableName = al.TableName AND my.SchemaName = al.SchemaName WHERE my.TableName IS NULL; -- Right! Shows missing rows from #MyTableList.
We’ve found the missing data! Of course, there’s more than one way to skin a query.
Another right way: WHERE NOT EXISTS
Very similar to the last example, we can compare the two result sets based on a matching criteria, and only return the rows that don’t exist in #MyTableList. The following query does exactly that, with a correlated subquery:
SELECT al.SchemaName , al.TableName FROM #AllTableList AS al WHERE NOT EXISTS ( SELECT * FROM #MyTableList AS my WHERE my.TableName = al.TableName AND my.SchemaName = al.SchemaName );
In pseudocode: Gimme All, where a match doesn’t exist (NOT EXISTS) in #MyTableList (as compared by TableName and SchemaName). Pretty cool, right?
Another right way: EXCEPT
The EXCEPT operator compares two result sets, column to column, and returns everything from the first set that isn’t in the second. So we can do exactly what we just did, like this:
SELECT SchemaName , TableName FROM #AllTableList EXCEPT SELECT SchemaName , TableName FROM #MyTableList;
Read in pseudocode, this says “give me schema and table from #AllTableList, unless you see that schema/table pair in #MyTableList”. Nice, eh?
These aren’t the only ways to get this information, of course. They’re just some of the most common.
For a little more on this topic – most especially why you might not want to use “NOT IN”, take a quick look at #4 on “The Ten Most Asked SQL Server Questions And Their Answers” on Less Than Dot.