Populate hashes
To ensure consistent coding for documents in a review, you can identify documents that are exactly the same as other documents, and create master and duplicates groups.
A document that is an exact copy of another document is called a duplicate document. The master document among a set of duplicate documents is generally the copy that is loaded into the application first. To be exactly the same, documents must have the same MD5 hash value and the same family MD5 hash value. Hashes are numerical values that identify unique documents and document families.
Before reviewers can work with master and duplicate documents on the Documents page, you must run a populate hashes job. The populate hashes job writes each document hash value to a field in the database.
You can allow the coding values for a master document to apply to the associated duplicate documents. This reduces the number of documents that reviewers must review and code.
For information about how to work with master and duplicate documents on the Documents page, see Remove duplicate documents.
How Nuix Discover identifies master and duplicate documents
During the populate hashes job, the application identifies master and duplicate documents by applying the following hash values to documents:
Document hash value: String of characters generated by an algorithm and used to identify duplicate documents.
Family hash value: String of characters generated by an algorithm and used to identify duplicate document families. A document family is a group of related source and attachment documents, such as a group of email messages. In identical document families, each family must have identical source documents, and every attachment in the family must have an individual duplicate in the other families. The position of the attachments within a document family does not affect whether documents are considered family duplicates.
The family hash value for a document is populated in one of the following ways:
A user provides a family hash value during import or by coding the [RT] Family MD5 Hash field.
If a user does not provide a family hash value, and the Use Assigned Family Hash rules case option is not selected, the application generates a family hash based on the hash values of all of the documents in the family.
If a user does not provide a family hash value, and the Use Assigned Family Hash rules case option is selected, the application uses the hash of the top parent in the document family as the family hash for every document in the family.
The application uses the hash values to compare documents at two levels:
Individual: The application compares each selected document with all other documents in the document set. If two documents have matching document hash values, the application identifies these documents as individual duplicates. The application designates the copy that was loaded into the application first as the master document.
Family: The application first compares each selected document with all other documents in the document set, and then compares each selected document with their document families. For two documents to be family duplicates, both documents must have the same document hash value and the same family hash value. This indicates that both documents are also from duplicate document families.
Family comparison is more rigorous than individual comparison and typically identifies fewer duplicates.
Master and duplicates examples
The following figures depict possible scenarios for master and duplicate documents and groupings. In the following figures, documents that have the same document hash values are indicated by the same color.
The documents in Document Families A and B have the same document hash values and the same family hash values, and are therefore in the same master and duplicates group.
The documents in Document Families A and C have the same document hash values, but the position of the attachments in the document families is different. However, because the position of the attachments does not affect whether documents are considered family duplicates, Document Families A and C also have the same family hash values, and are therefore in the same master and duplicates group.
In Document Families A and D, not all of the attachments are individual duplicates of each other. This means that the documents in Document Families A and D are not family duplicates, and are therefore not in the same master and duplicates group.
After the documents are grouped together, the application designates the copy that was loaded into the application first as the master document. All of the other documents in the master and duplicates group are designated as duplicates of the master document.
In the following figure, a dashed line surrounds a group of identical documents in Document Families A, B, and C. These identical documents are a group of master and duplicate documents. If the documents in Document Family A were loaded into the application first, then Attachment A1 is considered the master document. Attachments B1 and C1 are considered duplicates of Attachment A1. In addition, because these individual duplicate documents are in identical families, the documents are also considered family duplicates.
View populate hashes jobs
Administrators and group leaders with permissions can view and add populate hashes jobs. For information about how to grant administrative access to group leaders, see Grant administrative access.
Note: You cannot delete populate hashes jobs.
To access the Hashes page:
On the Case Home page, under Manage Documents, click Hashes.
The following information appears on the Hashes page:
Status: Hover over the icon to view information about the status.
Job ID: The identification number of the job.
Populate: Indicates whether hashes are populated for all documents in the case or for unpopulated documents only.
Start: The date and time the job was submitted.
End: The date and time the job was completed.
Duration: How long the job took to complete.
Document errors: The number of documents with errors.
Owner: The name of the person who submitted the job.
Add a populate hashes job
Administrators and group leaders with administrative access can submit a populate hashes job. For information about how to grant group leaders access to the Hashes page, see Grant administrative access.
A populate hashes job also runs during an ingestion or an import.
To add a populate hashes job:
On the Case Home page, under Manage Documents, click Hashes.
Click Add.
Note: You cannot add a new populate hashes job while another job is running.
Select one of the following options:
All documents: Populates hashes and recalculates master and duplicate relationships for all documents.
Unpopulated only: Populates hashes for only unpopulated documents.
Click Save.
The new job appears in the list on the Hashes page. If the new job does not appear, click the Refresh button at the bottom of the page.
Optionally, to view the stages of the populate hashes job and the status of each stage, in the Job ID column, click the link for a job number. Hover over the icon to view information about the status.
View hash settings
Administrators can view the settings for the populate hashes job. The settings cannot be changed.
To view the hash settings:
On the Case Home page, under Manage Documents, click Hashes.
Click Hash settings.
View populate hashes errors
Administrators and group leaders with permissions can view job and document errors. A job error can be due to a document error. For example, a document error occurs if the master and duplicate documents have multiple hashes or invalid hashes.
To view populate hashes errors:
On the Case Home page, under Manage Documents, click Hashes.
Do any of the following:
To view job errors, click a link in the Job ID column. You can view the stages of a job and determine which stage of the job failed.
To view document errors, click a link in the Document errors column.
The documents with errors appear on the Documents page.
Allow duplicate documents to inherit coding values from master documents
To reduce the number of documents that reviewers must review and code, you can allow the coding values for a master document to apply to the associated duplicate documents. This process is called autocoding. You can enable autocoding for fields and issues.
To allow duplicate documents to inherit coding values from their master document:
Create a field or issue. In the field or issue Name, add the suffix [AC], for autocoding. For example, Confidentiality [AC].
For information about how to create fields, see Work with fields.
For information about how to create issues, see Work with issues.
Note: Memo fields do not inherit coding.
Add the field or issue to a coding template. For information about how to create coding templates, see Work with coding templates.