Philosophy and Query to Find Duplicate Records

I was inspired today to pass on my method for finding dupliate records.  I'm sure we could put together a SP to generate the duplicate records query for a given table, but that's not how I roll today…today, it's about the journey.

First understand that a duplicate record is whatever you say it is.  If your table has a unique column – for examlpe, an IDENTITY column as the surrogate primary key – you can't include that in your query, or you'll never find duplicate rows.  And there are certain other things that may not count in terms of duplicates…for example, a LastModifiedDate column. Maybe you don't care if two rows have different modified by dates, and it's only the data that counts.  It's your database, your business rules…only you know for sure what counts as a dupe.

So given that, here's the basic rundown for getting duplicate rows:

SELECT count(*) as NumberOfRows,

  firstName ,

  lastName ,

  phoneNumber ,

  customerAccount

FROM Customer

GROUP BY firstName ,

  lastName ,
  phoneNumber ,

  customerAccount

HAVING COUNT(*) > 1

Notice that we included a count(*) in the select list, just to see how many duplicates are in the table (it's not actually required, just nice to have).  Otherwise, the select list is the same as the GROUP BY list.  HAVING COUNT(*) > 1 lets us return only the rows that have duplicates.  Finally, I often include a MAX(CustomerID) and/or MIN(CustomerID) in the SELECT list, to get the first and last instances of the dulpicated rows.  That can come in handy when you want to alter or delete all but the most recent duplicate in each group.  But that's a blog for another day.  What they hay, let's see the query with the MIN and MAX:

SELECT count(*) as NumberOfRows,

  MIN(CustomerID),

  MAX(CustomerID),

  firstName ,

  lastName ,

  phoneNumber ,

  customerAccount

FROM Customer

GROUP BY firstName ,

  lastName ,
  phoneNumber ,

  customerAccount

HAVING COUNT(*) > 1

-Jen McCown, http://www.MidnightDBA.com