Work with predictive coding
Predictive coding allows you to apply a positive or negative code to a large population of documents using reviewers' marks on a relatively small set of training documents. The predictive coding process uses a predictive model, which is trained to mimic the marks of expert human reviewers.
What is a predictive model? A predictive model is a model that is trained using experienced reviewers' marks on a set of training documents. The model maps the human reviewers' marks to the weighted characteristics of the training documents. You can then use the model to predict codes for the unmarked documents in a target population.
How does predictive coding ensure the statistical validity of its predictions? A trained predictive model can be used to predict a population of documents that does not include the model's training set documents. To ensure the statistical validity of a model's predictions, predictive coding automatically removes a model's training set documents from any population or sample that you add to a model.
How does predictive coding use expert human reviewers? Experienced reviewers are crucial to predictive coding in the following ways:
To train a model, experts review and mark the model's training set.
To evaluate the quality of a model's predictions, experts review and mark a representative sample of each population to be predicted.
For information about how to use perform predictive coding, see the following topics:
Preliminary steps for predictive coding
Perform predictive coding using Continuous Active Learning (CAL)
Perform predictive coding using the standard workflow
Compare standard predictive coding and Continuous Active Learning
Continuous Active Learning (CAL) differs from the standard predictive coding workflow in a few important ways, summarized in the following table. For more information about population-based predictive coding with CAL, see Perform predictive coding using Continuous Active Learning.
Standard workflow |
CAL workflow |
|
Use when... |
The workflow allows producing without reviewing predicted positives (second request). It is important to be able to plan for how long the review will take. You value flexibility. Documents will be collected and processed on a rolling basis. |
Very low prevalence expected. There is no distinction between training and review documents. You value workflow simplicity over flexibility. Up-front model training process is not feasible or desirable. You simply want to start reviewing. |
Training |
You have the flexibility to use random samples, judgmental samples, or active learning to select the most helpful documents for training a model. |
The model simply trains on all reviewed documents in the population. The reviewed sample is included in the training set. |
Predicting |
You can apply a trained model to multiple target populations. The model can be used to prioritize documents for review. However, you can also set a threshold to code documents as positive and negative, based on a trade-off between recall and precision. |
You do not set a threshold or apply codes. The model is used only to prioritize (score) documents within its population for review. |
Validation |
Through a series of discrete model versions, you can measure progress in terms of recall, precision, and yield. Requires a dedicated validation sample that excludes the training set and is representative of the target population. Produces defensible recall and precision estimates based on a representative sample, recorded in a final report. |
Just one model version, updated continuously. Produces a defensible estimate of achieved recall and actual precision, updated continuously. |
Knowing when you're done |
Done when acceptable recall and precision are achieved, or when further training ceases to be economical. |
Done when "recall to date" is sufficiently high. |