E-Discovery

What Is Redaction? Defensible Document Redaction in E-Discovery

A black rectangle on a PDF isn’t a redaction. It’s a sticker. A real redaction physically deletes the bytes underneath, and federal rules put that deletion squarely on counsel’s desk.

Alexander Cohan, Ph.D.

Founder & CEO, Hintyr

Legal technology researcher and data scientist specializing in AI governance for litigation teams. Expertise in NLP and AI-assisted document review.

May 9, 2026

Permanent PDF redaction over a contract page, illustrating defensible document redaction in e-discovery.

"First Principles"

What Redaction Actually Means

Most lawyers learn the word the way they learned “discovery”: through context, mostly correctly, and never with a definition that survives a hard question. Pin it down. Redaction is the permanent removal of content from a document. Not hiding. Not covering. Removal. Adobe’s own documentation puts it bluntly: once you confirm, “the redactions and hidden information are permanently removed and saved to a new file.” The key word is permanently. If the underlying text or pixels can be recovered, you didn’t redact. You annotated.

That’s the conceptual gap behind every famous redaction failure you’ve read about. PyMuPDF, the open-source PDF library that quietly powers a lot of legal tooling, frames the engineering reality plainly: a black rectangle drawn over a name in a complaint isn’t enough, because the underlying content stream “could still be extracted by a program.” Once a true redaction is saved, by contrast, the redacted area “removes text and graphics completely from that area.” The black bar is a confirmation of work already done, not the work itself.

Why does the small-firm bar care about this? The cost of getting it wrong is asymmetric. Scrub a privileged communication correctly and nobody notices. Miss one byte of unredacted client data and the failure shows up in a Bloomberg headline, an opposing motion, or your malpractice carrier’s inbox. You can’t unship a production. Once a defective PDF is on a court’s docket, the clawback fight is already happening on the back foot.

Redaction sits inside the much larger machinery of the broader e-discovery workflow, but it has its own ethical contours. Collection and processing are mostly engineering problems. Redaction is a judgment call, applied at the document level, that touches privilege, statutory privacy, and protective-order obligations at once. Treat it like the legal task it actually is, not the formatting step it pretends to be.

"Federal Privacy Rule"

FRCP 5.2 and the Counsel-Owned Duty

Federal Rule of Civil Procedure 5.2 is the closest thing federal litigation has to a baseline redaction statute. When a filing contains an individual’s social-security number, taxpayer identification number, birth date, the name of a known minor, or a financial-account number, Rule 5.2(a) lets you include only the last four digits of the SSN or taxpayer ID, the year of birth, the minor’s initials, and the last four digits of the financial account. Everything else gets cut.

There are carve-outs. Rule 5.2(b) exempts forfeiture filings, administrative records, state-court records, sealed filings, and pro se habeas filings. Rule 5.2(d) lets the court order a filing under seal without redaction, then later unseal or require a public redacted version. Rule 5.2(h) is the trap. A person waives the privacy protection for their own information by filing it without redaction and not under seal. That’s a one-way door.

The bankruptcy, criminal, and appellate counterparts run parallel. Federal Rule of Criminal Procedure 49.1 mirrors 5.2 and adds the home address. Federal Rule of Bankruptcy Procedure 9037 tracks 5.2 almost word for word. Federal Rule of Appellate Procedure 25(a)(5) carries the same obligations into appeal.

Here’s the part that surprises lawyers who haven’t read the advisory note: the clerk does not check your work. The Advisory Committee Note to the 2007 adoption says so directly. A federal clerk’s office isn’t a privacy-compliance department. It’s a filing intake. The filer holds the duty, and if the filer is represented, that duty runs through counsel of record. Sanctioned cases where lawyers filed unredacted SSNs exist because courts decided early that 5.2 would be toothless if nobody enforced it.

“The clerk is not required to review documents filed with the court for compliance with this rule. The responsibility to redact filings rests with counsel and the party or nonparty making the filing.”→

– Fed. R. Civ. P. 5.2 advisory committee's note to 2007 adoption

"Beyond the Federal Rule"

HIPAA Safe Harbor, State Variants, and Privilege

Rule 5.2 is a floor, not a ceiling. The moment a matter touches protected health information, you’re in HIPAA territory, and the Safe Harbor standard at 45 C.F.R. § 164.514(b)(2) takes over. Safe Harbor lists eighteen categories of identifiers that have to come out before data is treated as de-identified:

names
geographic subdivisions smaller than a state
every element of a date more specific than year
telephone numbers
fax numbers
email addresses
social-security numbers
medical-record numbers
health-plan beneficiary numbers
account numbers
certificate and license numbers
vehicle identifiers and serial numbers
device identifiers and serial numbers
URLs
IP addresses
biometric identifiers, including finger and voice prints
full-face photographs
any other unique identifying number, characteristic, or code

There’s a narrow ZIP-prefix carve-out, and ages over 89 collapse into “90 or older.” Removing fewer than eighteen means the data isn’t de-identified.

State rules layer on top. Texas Rule 21c defines sensitive data as government IDs, financial accounts, birth dates, home addresses, and the names of minors at filing. The rule even dictates format: an “X” in place of each omitted character, or visible removal indicating redaction. Illinois Supreme Court Rule 138 covers a similar set of identifiers and tells courts to award fees if the violation was willful. New Jersey, California, Florida, New York, Massachusetts, Pennsylvania, and Washington each carry their own variants. None of them substitute for 5.2. They stack on it.

Then there’s privilege. The attorney-client privilege the Supreme Court reaffirmed in Upjohn Co. v. United States, 449 U.S. 383, 389 (1981), and the work-product doctrine from Hickman v. Taylor, 329 U.S. 495, 510-11 (1947), now codified at Federal Rule of Civil Procedure 26(b)(3), are the most common reasons you actually redact in litigation as opposed to filing. The standard is fact-specific, document by document. You’re not removing a number that matches a regex. You’re protecting a statement of legal advice or a memo prepared in anticipation of litigation, embedded inside an email thread that’s otherwise responsive.

That subjectivity bleeds into the duty to preserve and the spoliation risks that follow, because producing a document with a privilege redaction is itself a representation about what’s underneath the black bar. If the redaction is wrong, the production is wrong.

"Recurring Failures"

Why Black Boxes Aren’t Redactions

The reason redaction failures repeat decade after decade is mechanical, not legal. The NSA’s 2005 guidance boiled it down to two errors: either you draw an image layer over text without removing the underlying text, or you set the background color to match the text color. Either way, the material is still in the document, sitting underneath the visible appearance, recoverable by search, copy-paste, or any free PDF library. About half a dozen variants live in the wild: white-on-white font tricks, image annotations laid over searchable text, metadata leakage from comments and revision history, content-stream text that doesn’t render but stays in the file, defeated OCR, and flatten-versus-true-delete confusion. Same mistake, different costumes.

The cautionary tales most litigators know land on the same point. Paul Manafort’s defense team filed a sentencing memo in January 2019 with black bars laid over text. A reporter recovered the language with a copy-paste, exposing the polling-data sharing the redactions were meant to hide.

Quinn Emanuel’s “junior associate missing one redaction” in Apple Inc. v. Samsung Elecs. Co., 2014 WL 12596470 (N.D. Cal. Jan. 29, 2014), turned into a sanctions order after the same defective document was uploaded to an FTP site and emailed to more than ninety Samsung employees. The Maxwell-Epstein DOJ release in May 2026 produced what victim counsel called thousands of redaction failures in 48 hours, when the agency had been ordered to do one thing: redact known victim identifiers.

What every one of these shares is a workflow that looked like redaction without producing redaction. The doctrine small firms can carry out of this is the one the NSA wrote down twenty years ago: visual hiding isn’t removal, and any document that lets you recover the original by selecting and copying hasn’t been redacted. For the vendor-specific QC playbook, our deep dive on AI redaction failures and pre-production QC sits next door and treats those errors at the platform level. For us here, the doctrine: an annotation isn’t a redaction.

“The way to avoid exposure is to ensure that sensitive information is not just visually hidden or made illegible, but is actually removed from the original document.”→

– National Security Agency, Redacting with Confidence, Report I333-015R-2005 (Dec. 13, 2005)

"Pattern Matching to Models"

AI-Assisted Redaction at Scale

For a long time, “automatic” redaction meant pattern matching. A regex for SSNs, another for phone numbers, another for credit-card-shaped strings. That approach still has a job. It catches easy stuff at high speed, perfect for structured fields with predictable formats. Where it falls apart is everywhere else. A medical-record number doesn’t have a national format. An attorney’s name in an email signature triggers nothing. A street address split across two lines breaks any regex that assumes one line. Pattern matching catches what it was told to find and misses everything that requires reading.

That gap is what the current generation of AI-assisted tools is built to close. Most modern vendors run hybrid pipelines that mix pattern recognition with machine-learning models; what differentiates products is the review surface and the audit trail rather than which detector did the first pass. Casepoint and Foxit Smart Redact emphasize pattern coverage for SSNs, phone numbers, names, emails, and credit-card formats. Relativity’s PI Detect applies what Relativity calls “façade redactions” inside RelativityOne after its detector identifies personal information. Reveal-Brainspace adds image-layer detection, tagging driver’s licenses, passports, and similar objects inside scanned documents. Everlaw’s AI Assistant keeps source links so an attorney can confirm any flag before the deletion is applied. The common thread is that the model proposes, the attorney decides, and the system applies the deletion only after a human signs off.

A public reference point for where the technology sits: OpenAI’s Privacy Filter reports an F1 score of 97.43% on a corrected version of the PII-Masking-300k benchmark, with 96.79% precision and 98.08% recall. Strong numbers, useful as a reference. They aren’t a license to skip review. “Reasonable steps to prevent disclosure” in Federal Rule of Evidence 502(b) is the legal anchor, and the only way to satisfy it is human-in-the-loop review of any AI-proposed redaction before production. That runs alongside the lawyer’s duty to supervise AI tools, and it doesn’t go away because the model’s recall is high.

“Privacy protection in modern AI systems depends on more than pattern matching. Traditional PII detection tools often rely on deterministic rules for formats like phone numbers and email addresses. They can work well for narrow cases, but they often miss more subtle personal information and struggle with context.”→

– OpenAI, Introducing OpenAI Privacy Filter (2026)

"Ethics and Defensibility"

ABA Op 512, FRE 502(b), and Sedona Principle 6

There’s an ethics layer underneath all of this that small firms often skip until the bar complaint arrives, and if you’re running redactions yourself it lands on you directly. ABA Model Rule 1.6(c) requires every lawyer to make reasonable efforts to prevent the inadvertent or unauthorized disclosure of, or unauthorized access to, information relating to the representation of a client. That language sits squarely on redaction practice. Every defective redaction is, by definition, an inadvertent disclosure of confidential client information. ABA Formal Opinion 477R (2017) applied Rule 1.6(c) to electronic communication and storage, naming each device, each location, and each transmission as an opportunity for disclosure that triggers the duty.

Formal Opinion 512, issued July 29, 2024, extended the same logic to generative AI. Lawyers should have a reasonable and current understanding of the specific capabilities and limitations of any generative AI tool that they wish to use. Translated to redaction work: you can’t outsource the judgment call to a model. You can use a model to find candidates and apply deletions at scale, but the supervision duty stays with you, and so does the privilege risk that travels alongside any AI-assisted workflow.

Federal Rule of Evidence 502(b) is the safety net. When a privileged document is produced inadvertently in federal proceedings, 502(b) preserves privilege if three things are true: the disclosure was inadvertent, the holder took reasonable steps to prevent it, and the holder promptly acted to rectify under Rule 26(b)(5)(B). The 2008 advisory committee note says the quiet part out loud: a party that uses advanced analytical software applications and linguistic tools to screen for privilege may be found to have taken “reasonable steps” to prevent inadvertent disclosure. That’s the doctrinal hook for AI redaction used defensibly.

The catch is that the 502(b) shield won’t help a process that didn’t actually screen, and it won’t help if you can’t reconstruct what the screening looked like later. Document the methodology, or you don’t have one.

“Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.”→

– The Sedona Principles, Third Edition, Principle 6 (2018)

"Small-Firm Playbook"

What Defensible Redaction Looks Like in 2026

A defensible redaction process in 2026 has three structural pieces and one cultural one. The structural pieces are: true deletion at the document level, AI-assisted surfacing of candidates with a reviewing attorney making every final call, and an audit trail that records what was redacted, by whom, on what authority, and at what time. The cultural piece is harder. It’s accepting that redaction is legal work, not paralegal work, even when the volume tempts you to delegate it as formatting.

The structural pieces map cleanly to the rules we’ve already walked through. True deletion answers the NSA failure modes. Human-in-the-loop AI satisfies Rule 1.6(c) and Formal Op 512. The audit trail is what 502(b) wants when you need to demonstrate “reasonable steps” later. Together they’re also the substrate for a Rule 26 proportionality argument when opposing counsel pushes back on volume.

Hintyr is built around exactly this shape. The redaction overview walks through the architecture: PDF-native true deletion with a logged audit trail per document, suggestions from AI-assisted redaction that flag PII and privileged content for attorney review, and batch redaction that applies an attorney-confirmed pattern across a population without losing per-document review records. None of that replaces your judgment. It just makes the judgment defensible at scale.

One thing this playbook isn’t: a substitute for reading the rule. Every state’s redaction format is its own statute or local rule, and the categories shift with each protective order. The playbook is the workflow. The rules are the inputs.

"Common Questions"

Frequently Asked Questions

How is redaction different from masking?

Masking covers content visually. Redaction removes it. If you can recover the original by selecting and copying, or by extracting through any PDF library, the document was masked, not redacted. In a real redaction the bytes are gone, and the rectangle that’s left can’t be reverse-engineered.

What does FRCP 5.2 require?

Rule 5.2(a) limits filings to the last four digits of an SSN or taxpayer ID, the year of birth, a minor’s initials, and the last four digits of a financial account. The advisory note assigns the duty to counsel and the filing party, not the clerk. Criminal, bankruptcy, and appellate counterparts mirror the same categories.

Does HIPAA Safe Harbor apply to litigation files?

Whenever a matter touches protected health information you intend to treat as de-identified. Safe Harbor at 45 C.F.R. § 164.514(b)(2) lists eighteen categories that have to come out, with a narrow ZIP-prefix exception. Removing fewer than eighteen means the data isn’t de-identified.

Can I rely on AI tools to handle redaction unsupervised?

No. ABA Formal Opinion 512 and Rule 1.6(c) put a current-understanding duty on the lawyer using the tool. AI redaction near 97% F1 on public benchmarks is a strong proposal layer, not a substitute for attorney review. The 502(b) shield depends on documenting that a human signed off.

What happens if a redaction fails after production?

Federal Rule of Evidence 502(b) can preserve privilege if the disclosure was inadvertent, you took reasonable steps to prevent it, and you act promptly to rectify under Rule 26(b)(5)(B). If any leg fails, the privilege can be waived, with sanctions or fee-shifting following.

Is redaction the same as sealing?

No. Sealing keeps a document out of the public record by court order. Redaction removes specific content from a document filed publicly. You sometimes use both: a court orders a filing under seal, then later orders a public redacted version under Rule 5.2(d).

This article is for general informational purposes only. It does not constitute legal advice and does not create an attorney-client relationship. Statements about case law and rules reflect publicly available sources as of May 2026 and may not address your jurisdiction or matter. Consult qualified counsel before acting on any of the topics discussed.

Redact at the byte level, not the rectangle level.

Hintyr is agentic document review with PDF-native true deletion, AI-surfaced PII and privilege candidates kept under attorney review, and an audit trail built for a 502(b) hearing. Run it on a single filing or a fifty-thousand-document production.

Start Today Book a Demo