Predictive coding is a machine learning technique used in e-discovery where an algorithm trains on a sample of human-reviewed documents, then automatically classifies the remaining population as relevant, non-relevant, or requiring further review. It's one of the most widely adopted forms of Technology-Assisted Review.
What is predictive coding?
Predictive coding is a supervised machine learning approach where human reviewers code a sample of documents, and an algorithm uses those decisions to predict how the rest should be classified. The term is often used interchangeably with Technology-Assisted Review (TAR). Strictly speaking, predictive coding refers to the algorithmic classification step within the broader TAR workflow.
Predictive coding gained judicial acceptance in Da Silva Moore v. Publicis Groupe (S.D.N.Y. 2012), a case that marked the first time a U.S. federal court endorsed the use of computer-assisted review for document production. The court in Rio Tinto PLC v. Vale S.A. (S.D.N.Y. 2015) further affirmed that predictive coding is an acceptable and even preferable method when dealing with large document volumes, noting that it can satisfy FRCP Rule 26(g) certification obligations when properly validated.
The process typically works in iterative rounds. A subject matter expert reviews an initial seed set. The algorithm trains on those decisions, then ranks the remaining population by predicted relevance. The expert reviews more samples, the model retrains, and the cycle repeats until accuracy stabilizes at an acceptable level. This iterative process distinguishes predictive coding from simpler keyword searches, which can't adapt to context.
"By signing, an attorney or party certifies that to the best of the person's knowledge, information, and belief formed after a reasonable inquiry, a discovery request, response, or objection is consistent with these rules."-- Federal Rules of Civil Procedure, Rule 26(g)(1)(B), under which courts have recognized predictive coding as satisfying the "reasonable inquiry" standard
Key facts
- Grossman and Cormack's TREC Legal Track studies showed that predictive coding achieved higher recall and precision than exhaustive manual review across multiple data sets.
- A 2019 EDRM survey found that over 50% of e-discovery practitioners use some form of predictive coding in their review workflows.
- The distinction between TAR 1.0 and TAR 2.0 refers to whether the training process is a one-time step (1.0) or continuously updated as reviewers work (2.0).
- Courts have held that predictive coding results must be validated through statistical sampling to be defensible in litigation.
Predictive coding in Hintyr
Hintyr implements predictive coding through its TAR validation workflow. You train the model by grading document samples in the grading panel, and the system predicts relevance for the remaining set. The workflow supports both TAR 1.0 (Control Set), where a fixed sample validates the model, and TAR 2.0 (Elusion Testing), which tests whether responsive documents were missed in the discard pile.
You can create a validation test from the Case Menu, configure your statistical parameters, and let Hintyr draw a random sample for grading. For broader AI-powered review beyond statistical validation, Hintyr's AI agent can analyze documents, identify patterns, and assist with relevance determinations across your entire case.
Frequently asked questions
Is predictive coding the same as TAR?
How accurate is predictive coding compared to manual review?
Do I need a subject matter expert to use predictive coding?
How many documents need to be reviewed to train the model?
Related terms
- TAR (Technology-Assisted Review)
- Review Platform
- Agentic Review
- ESI (Electronically Stored Information)
- Proportionality