Compare tables to find missing rows

Get it? Missing rows?
Get it?

Let’s talk about the case where you want to compare tables to see what’s missing. We might be comparing a list of orders to an imported set of data in a staging table. Or, we’re identifying customers in the database that aren’t on the call list. Whatever the application, finding out the difference between one table and another is a very common need.

First, a bit of setup

We’ll set up a couple of temporary tables for our example, and load them up with a few random schema and table names. That’s a nice generic set of data. If we want to imagine a scenario for this data, this is something we might use in a custom maintenance routine (e.g., “which tables in the database weren’t reindexed last week”?).

CREATE TABLE #AllTableList
    (
      SchemaName SYSNAME ,
      TableName SYSNAME
    );

CREATE TABLE #MyTableList
    (
      SchemaName SYSNAME ,
      TableName SYSNAME
    );

INSERT  INTO #AllTableList
        ( SchemaName ,
          TableName
        )
        SELECT TOP 10
                s.name AS SchemaName ,
                o.name AS TableName
        FROM    sys.objects o
                JOIN sys.schemas s ON o.schema_id = s.schema_id
        WHERE   o.type = 'u'; 

INSERT  INTO #MyTableList
        ( SchemaName ,
          TableName
        )
        SELECT TOP 5
                s.name AS SchemaName ,
                o.name AS TableName
        FROM    sys.objects o
                JOIN sys.schemas s ON o.schema_id = s.schema_id
        WHERE   o.type = 'u';

There we go, one table with “all” the data, and one table with rows “missing”. So how do we tell the difference?

This query is wrong

If we reason it out, it makes sense that you’d have to join the tables together to see where the table and schema names within are NOT equal. Let’s see if that works:

SELECT DISTINCT
        al.SchemaName ,
        al.TableName 
FROM    #AllTableList AS al
        INNER JOIN #MyTableList AS my ON my.TableName <> al.TableName;

Run this, and you get back 45 rows. What went wrong?

What happened is exactly what we asked for: SQL joined every schema.tablename from “All” with every one from “My” that didn’t match. Go ahead and add the “My” columns to the select list, and you’ll see what I’m talking about:

SELECT DISTINCT
        al.SchemaName ,
        al.TableName ,
        my.SchemaName ,
        my.TableName
FROM    #AllTableList AS al
        INNER JOIN #MyTableList AS my ON my.TableName <> al.TableName;

Now in our resultset, we see that dbo.dep1 is paired with dbo.dep2, and so on. We asked for mismatches, we got that. We didn’t get any info about missing data!

img2

One right way: OUTER JOIN

We want to look at all the data from the “All” table, whether or not it matches, and then discard the matches, because we only want tables that don’t have a match. Let’s do this in two steps.

First, to get back all the data from “All”, we need a LEFT OUTER JOIN.  And don’t forget we said “whether or not it matches”…if we’re talking about matches, then we need to get matches! That means we need to use a match criteria: equals (=) instead of not equals (<>).   Oh, and we’ll want to add SchemaName back into the mix….it takes a table name and a schema to qualify as a match, in this case.

SELECT DISTINCT
        al.SchemaName ,
        al.TableName ,
        my.SchemaName ,
        my.TableName
FROM    #AllTableList AS al
        LEFT OUTER JOIN #MyTableList AS my ON my.TableName = al.TableName
                                              AND my.SchemaName = al.SchemaName; -- not quite right yet

Now, that’s more like it! We can clearly see where the table names and schemas match, and where we’re missing data!

img3

So all that’s missing is a WHERE my.TableName IS NULL clause, and we’re golden:

SELECT DISTINCT
        al.SchemaName ,
        al.TableName ,
        my.SchemaName ,
        my.TableName
FROM    #AllTableList AS al
        LEFT OUTER JOIN #MyTableList AS my ON my.TableName = al.TableName
                                              AND my.SchemaName = al.SchemaName
        WHERE my.TableName IS NULL; -- Right! Shows missing rows from #MyTableList.

img4

We’ve found the missing data! Of course, there’s more than one way to skin a query.

Another right way: WHERE NOT EXISTS

Very similar to the last example, we can compare the two result sets based on a matching criteria, and only return the rows that don’t exist in #MyTableList. The following query does exactly that, with a correlated subquery:

SELECT  al.SchemaName ,
        al.TableName
FROM    #AllTableList AS al
WHERE   NOT EXISTS ( SELECT *
                     FROM   #MyTableList AS my
                     WHERE  my.TableName = al.TableName
                            AND my.SchemaName = al.SchemaName );

In pseudocode: Gimme All, where a match doesn’t exist (NOT EXISTS) in #MyTableList (as compared by TableName and SchemaName). Pretty cool, right?

Another right way: EXCEPT

The EXCEPT operator compares two result sets, column to column, and returns everything from the first set that isn’t in the second. So we can do exactly what we just did, like this:

SELECT  SchemaName ,
        TableName
FROM    #AllTableList
EXCEPT
SELECT  SchemaName ,
        TableName
FROM    #MyTableList;

Read in pseudocode, this says “give me schema and table from #AllTableList, unless you see that schema/table pair in #MyTableList”. Nice, eh?

These aren’t the only ways to get this information, of course. They’re just some of the most common.

Happy days,
Jen McCown
www.MidnightDBA.com/Jen

For a little more on this topic – most especially why you might not want to use “NOT IN”, take a quick look at #4 on “The Ten Most Asked SQL Server Questions And Their Answers” on Less Than Dot.

4 thoughts on “Compare tables to find missing rows

  1. Pingback: (SFTW) SQL Server Links 15/11/13 • John Sansom

  2. Pingback: My links of the week – November 17, 2013 | R4

  3. Shalini Hurley

    This is awesome, exactly what I needed and it wasn’t hard to figure out…Thank you.

Comments are closed.