Do I Need to Dedupe?

Let’s say I’ve got a set of 6,000 emails that might—or might not—have some duplicates in it.  Before I go crazy with deduping, I’d like a quick way to figure out whether this is an issue or not.  My strategy of choice is to use Advanced Filter to generate a list of the unique hash values.  If that list is considerably shorter than the original list of 6000, then some deduplicating is probably in order.

First, I select the column with the hash values in it and click on Advanced Filter on the Data Ribbon:

AF1

Now I fix the settings in the Advanced Filter dialog box:

  1. Select “Copy to another location”
  2. Select the appropriate List range
  3. Choose a destination for the unique values (a nearby empty column)
  4. Click the box for “Unique records only”
  5. Hit OK

 

af2b

The list of unique values will now appear in Column Q.

AF3

Well, well, well.  The list of unique values only goes to Row 4773.  That means we’ve got some 1,228 duplicates.

Of course, this doesn’t tell us which ones were duplicated, or how many duplicates of any given document there are.  It just tells us that there are plenty of dupes to be found.  In a later lesson we’ll use the COUNTIF function to identify which are the originals, and which are the dupes.


There’s a magic formula that will tell you how many unique values there are in a list or grid of values.  Two caveats though:

  1. It is very resource intensive, so if the list is large, it might bog down your computer for a while;
  2. If there any blank cells there you’ll have to fill them in with something (i.e., the word “blank”), otherwise you’ll get the dreaded #DIV/0! error.

For the list of Hash values above in cells O2 to O6001, the formula would be:

{=SUM(1/COUNTIF(O2:O6001,O2:O6001))}

Notice that this is an array formula; hit Ctrl+Shift+Enter, rather than Enter, and Excel will fill in the curly braces.

Here’s an example of the formula at work:

af5

The formula correctly determines that there are 5 unique values in cells C2 to C9, viz.,
apple, dog, tree, shell, spider.

I tip my finest hat—a lightweight but sturdy DelMonico Skip Panama with a 2” wired brim and a feathered simple black grosgrain band—to John Walkenbach, author of the Excel Bible series, for coming up with this beauty.

Advertisements

About excelesquire

NYC attorney and Excel enthusiast.
This entry was posted in eDiscovery and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s