Predictive Coding

Predictive coding is a machine learning technique used in e-discovery where an algorithm trains on a sample of human-reviewed documents, then automatically classifies the remaining population as relevant, non-relevant, or requiring further review. It's one of the most widely adopted forms of Technology-Assisted Review.

What is predictive coding?

Predictive coding is a supervised machine learning approach where human reviewers code a sample of documents, and an algorithm uses those decisions to predict how the rest should be classified. The term is often used interchangeably with Technology-Assisted Review (TAR). Strictly speaking, predictive coding refers to the algorithmic classification step within the broader TAR workflow.

Predictive coding gained judicial acceptance in Da Silva Moore v. Publicis Groupe (S.D.N.Y. 2012), a case that marked the first time a U.S. federal court endorsed the use of computer-assisted review for document production. The court in Rio Tinto PLC v. Vale S.A. (S.D.N.Y. 2015) further affirmed that predictive coding is an acceptable and even preferable method when dealing with large document volumes, noting that it can satisfy FRCP Rule 26(g) certification obligations when properly validated.

The process typically works in iterative rounds. A subject matter expert reviews an initial seed set. The algorithm trains on those decisions, then ranks the remaining population by predicted relevance. The expert reviews more samples, the model retrains, and the cycle repeats until accuracy stabilizes at an acceptable level. This iterative process distinguishes predictive coding from simpler keyword searches, which can't adapt to context.

"By signing, an attorney or party certifies that to the best of the person's knowledge, information, and belief formed after a reasonable inquiry, a discovery request, response, or objection is consistent with these rules." -- Federal Rules of Civil Procedure, Rule 26(g)(1)(B), under which courts have recognized predictive coding as satisfying the "reasonable inquiry" standard

Key facts

Grossman and Cormack's TREC Legal Track studies showed that predictive coding achieved higher recall and precision than exhaustive manual review across multiple data sets.
A 2019 EDRM survey found that over 50% of e-discovery practitioners use some form of predictive coding in their review workflows.
The distinction between TAR 1.0 and TAR 2.0 refers to whether the training process is a one-time step (1.0) or continuously updated as reviewers work (2.0).
Courts have held that predictive coding results must be validated through statistical sampling to be defensible in litigation.

Predictive coding in Hintyr

Hintyr implements predictive coding through its TAR validation workflow. You train the model by grading document samples in the grading panel, and the system predicts relevance for the remaining set. The workflow supports both TAR 1.0 (Control Set), where a fixed sample validates the model, and TAR 2.0 (Elusion Testing), which tests whether responsive documents were missed in the discard pile.

You can create a validation test from the Case Menu, configure your statistical parameters, and let Hintyr draw a random sample for grading. For broader AI-powered review beyond statistical validation, Hintyr's AI agent can analyze documents, identify patterns, and assist with relevance determinations across your entire case.

Frequently asked questions

Is predictive coding the same as TAR?

The terms are often used interchangeably, but predictive coding specifically refers to the machine learning classification step, while TAR is the broader workflow that includes training, validation, and review. In practice, most people mean the same thing when they use either term.

How accurate is predictive coding compared to manual review?

Studies by Grossman and Cormack in the TREC Legal Track found that predictive coding consistently achieved higher recall and precision than exhaustive manual review. Manual review typically suffers from reviewer fatigue and inconsistency across large document sets.

Do I need a subject matter expert to use predictive coding?

A knowledgeable reviewer should code the training documents to ensure the algorithm learns from accurate decisions. In Hintyr, any authorized reviewer can grade validation samples, but having a subject matter expert involved improves the quality of the training data.

How many documents need to be reviewed to train the model?

The required sample size depends on the confidence level, margin of error, and population size. Hintyr calculates the statistically appropriate sample size automatically when you create a validation test.

What is predictive coding?

Key facts

Predictive coding in Hintyr

Frequently asked questions

Related terms

Related articles