Today’s post is about a powerful Excel formula for tracking duplicates in a large set of email metadata. This routine is not for the faint of heart, but if you are managing a document production, supervising a redaction project, or creating a privilege log, this method will save you many hours of time and reduce the likelihood of mistakes. Roll up your sleeves and read on!
When the document ID numbers are in Column A, and the hash values (or other deduping criterion) are in Column B, type the word “original” in D2. The first formula, in cell D3, is (drum roll please):
=IF(COUNTIF($B$2:B3,B3)>1,”Duplicate of “&INDEX($A$2:A3,MATCH(B3,$B$2:B3,0)),”original”)
As explained in more detail below, the formula identifies duplicates based on the hash value. When it hits a duplicate value, it reports the Document ID number of the first document with the same hash value.
Here is a screen shot of the formula in action, in cell D7, where the first duplicate occurs:
Notice how the formula correctly identifies that the email in row 7, with Document ID ENRON-000004452, is a duplicate of the email in Row 6, with Document ID ENRON-000004451.
Here’s how it works:
The COUNTIF formula looks at all the hash values so far and counts up how many times the current value has occurred. If Excel is seeing the current value for the first time, the formula returns 1 – that means it is an “original” document. If it is seeing the hash value for, say, the 2nd time, it will return a 2. Column C in the screenshot contains just the COUNTIF function. If this is all you did, you would have a nice way of segregating the originals from the duplicates. All the 1’s are originals, and all the others are duplicates. The rest of the formula tells you what each duplicate is a duplicate of.
The MATCH formula finds the first occurrence of a value in a stack of values, and returns the relative row number where the value is found. Here, MATCH(B7,$B$2:B7,0) asks “In the stack of hash values from B2 (the first value) to B7 (the current value), where does the current hash value first appear?” The answer is 5: the 41DCD hash value first occurs in B6, which is the 5th row of the range. The 0 in the 3rd position of the MATCH function means that we want an exact match.
The INDEX formula pulls a value from a specified row of a specified stack of cells. For example, INDEX(A1:A100,8) would return the 8th value in the range A1:A100, i.e., the value in A8. The row number can be (and usually is) specified by another cell reference or a formula. For example, INDEX(A1:A100,C55) would return the value in A1:A100 corresponding to whatever value is in cell C55. Here, the MATCH function provides the reference. We want INDEX(A2:A7,n), where A2:A7 is the stack of Document ID numbers so far, and n is the row where the current hash value first appears. Hence, INDEX($A$2:A7,MATCH(B7,$B$2:B7,0).
Putting these three pieces together pulls the Document ID number of the first occurrence of any given hash value–that is the original document corresponding to any given duplicate! Of course, we only want to do this if the hash value in question is, in fact, a duplicate. That’s why this routine is embedded in an IF function that makes sure COUNTIF is greater than 1. Otherwise, the formula returns the message “original.”
A couple more notes about this routine:
- The references to A2 and B2 in the range arguments are anchored with $ signs because we don’t want those references to change when we copy the formula down. You can insert those $ signs by hitting the F4 key after typing or selecting the reference in the formula.
- As with most complicated formulas, you could break this formula down into several smaller pieces.
- The combination of INDEX and MATCH is a popular strategy for overcoming the limitations of the VLOOKUP function.
- The routine described here can also be used to link email attachments with their parent emails, based on the GroupID or FamilyID field.
Using this routine has helped me streamline privilege logs and document reviews, and avoid producing duplicate documents with inconsistent redactions. I hope you find it helpful, too.
This is quite a bit more complicated than what is covered in my off-the-shelf Excel Esquire courses. If you would like to learn about setting up a customized training that covers this type of eDiscovery analytics–whether for attorneys, paralegals, or litigation support staff–please get in touch.
Thanks very much for this! I was able to apply the formula to look for duplicate numbers in the same column, no need for conditional formatting, and I can apply further vlookups to the “original” text. 🙂
Thanks for the feedback, Margaret – I’m glad this worked for you!