AI Ethics

What Is a Hallucination in Legal AI?

Why AI cites cases that don’t exist, what the research actually measured, and the verification duty that keeps you out of a sanctions order.

Alexander Cohan, Ph.D.

Alexander Cohan, Ph.D.

Founder & CEO, Hintyr

Legal technology researcher and data scientist specializing in AI governance for litigation teams. Expertise in NLP and AI-assisted document review.

A lawyer comparing a brief against a law library, finding a cited case that does not exist.

Key Takeaways

  • A hallucination is output an AI presents as fact when it has no basis in fact. In legal work that’s cases no court ever filed, quotations no judge ever wrote, and holdings that say the opposite of what the tool claims.
  • The rates aren’t small. Stanford research found general chatbots hallucinate on 58% to 88% of specific legal questions, and even specialized legal research tools built on retrieval still miss 17% to 33%of the time. Those figures reflect 2023–2024 model versions.
  • Courts almost never punish the use of AI. They punish the failure to verify, and the lack of candor when a lawyer gets caught and digs in instead of owning it.
  • Retrieval-grounded tools cut the error rate. They don’t drop it to zero. Anyone selling “hallucination-free” is selling a floor as if it were a guarantee.
  • Checking your citations is your job, and you can’t delegate it. ABA Formal Opinion 512 keeps independent verification on you, whatever tool drafted the brief.
"First Principles"

What a hallucination actually is

A special master in a California federal case read a brief, found two of its cited cases persuasive, and looked them up to learn more. They didn’t exist. “That’s scary,” he wrote, because those fake authorities had nearly made it into a court order. What he hit has a name.

A hallucination is output that sounds right and is wrong. The model states the false thing in the same calm tone it uses for the true thing, and that’s what catches lawyers off guard. It happens because a large language model doesn’t look anything up. It predicts the next word from statistical patterns it learned in training, and when the data has a gap, it fills the gap with text shaped like a real answer. So a citation that doesn’t exist isn’t a glitch. It’s the system doing what it was built to do. Some researchers prefer “confabulation,” the word for gap-filling done in good faith, because the model isn’t seeing things so much as inventing to cover what it doesn’t know. Either way the lesson holds: this isn’t a rare malfunction you can patch. It’s a baseline behavior you have to check for, every time.

Courts have started defining the term themselves. The most-cited judicial definition comes from Judge Newsom’s concurrence in Snell v. United Specialty Insurance Co.: a generative AI program, in his words, “‘hallucinates’ when, in response to a user’s query, it generates facts that, well, just aren’t true, or at least not quite true.” Other courts have echoed it. The ethics regulators landed in the same place. ABA Formal Opinion 512 warns that “some GAI tools are also prone to ‘hallucinations,’ providing ostensibly plausible responses that have no basis in fact or reality.”

In legal work the fabrication takes three shapes. Phantom cases: citations to decisions that were never written, sometimes pinned on real judges. Invented quotations: passages in quotation marks attributed to an opinion that says no such thing. And misstated holdings: the case is real, the citation checks out, but it doesn’t stand for the point you’re citing it for. That last one is the sneaky one. It survives a quick existence check. You can pull up the case, confirm it’s real, and still be wrong about what it says, which is why “does this case exist” is the floor of verification and never the ceiling.

"The Numbers"

How often legal AI gets it wrong

You don’t have to take anyone’s word for how often legal AI invents things. Researchers at Stanford’s RegLab and its Institute for Human-Centered AI ran the experiment twice, against the model versions available in 2023 and 2024, so read every percentage below as a snapshot of those generations, not a permanent ceiling.

The first study put general-purpose chatbots on the stand, asking the major consumer models specific, verifiable questions about random federal cases and checking the answers against the record. The verdict was blunt: legal hallucinations were “alarmingly prevalent, occurring between 58% of the time with ChatGPT 4 and 88% with Llama 2.” On detailed questions about real cases, a leading model was wrong most of the time, and the weakest was wrong almost always. That’s why nobody serious tells you to paste a research question into a raw consumer chatbot.

The harder question is what happens with a tool built for law, one wired into Westlaw or Lexis that promises grounded, cited answers. So the same group ran a second study, hand-scoring 202 queries against the May 2024 versions of the leading platforms. Lexis+ AI hallucinated on about 17% of queries. Ask Practical Law AI landed near 17% too. Westlaw AI-Assisted Research came in highest, around 33%. A bare GPT-4 baseline sat at 43%. Stanford’s headline put it plainly: “AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries”. The specialized tools clearly beat the raw chatbot. They also clearly didn’t reach zero.

The definition the researchers used matters as much as the count. They treated a response as hallucinated if it was “either incorrect or misgrounded,” and that second category is the one that should keep you up at night. A misgrounded answer states the law correctly but cites a source that “does not in fact support its claims.” As Stanford’s write-up of the study put it, that kind of error “may be even more pernicious than the outright invention of legal cases,” because a made-up case name fails the first check you’d run, while a real-looking citation that doesn’t support your point sails past a quick look and into your brief. It’s the same trap you watch for when you’re spotting a fabricated citation in your opponent’s brief.

One fairness note. Thomson Reuters and LexisNexis dispute counting “misgrounding” as a hallucination at all, and Thomson Reuters argued Ask Practical Law AI is built for secondary sources, not primary-law research. Stanford answered the methodology point by retesting Westlaw AI-Assisted Research on its own terms, and it still hallucinated about 33% of the time. Keep the dates in view, too. Later work, including the Vals AI Legal Research Report from October 2025, shows general AI with web search starting to narrow the gap. The numbers move. The lesson about verification doesn’t.

"The Sanctions"

Why lawyers keep getting sanctioned

Read enough of these orders and one lesson surfaces. The sanction is almost never for using AI. It’s for failing to verify the output, and then, when caught, failing to come clean. Damien Charlotin’s database has logged more than 1,500 of these decisions worldwide, as of May 2026, and the count climbs every week.

It started with Mata v. Avianca. ChatGPT handed a lawyer six decisions that didn’t exist, with fake docket numbers, bogus quotes, and real judges named as authors of opinions they never wrote. Judge Castel imposed a $5,000 penalty under Rule 11, and was explicit that the tool wasn’t the problem: “there is nothing inherently improper about using a reliable artificial intelligence tool for assistance. But existing rules impose a gatekeeping role on attorneys to ensure the accuracy of their filings.” What drew the sanction was “conscious avoidance and false and misleading statements to the Court.” The lawyer doubled down when he should have folded.

The Second Circuit was just as plain in Park v. Kim, referring an attorney to its Grievance Panel after she cited a case she admitted she’d generated with ChatGPT. Rule 11, the court said, requires “that attorneys read, and thereby confirm the existence and validity of, the legal authorities on which they rely.”

Then the cases that should worry anyone who assumed enterprise tools were safe. In Wadsworth v. Walmart, lawyers filed a motion citing nine cases, eight of them nonexistent, from their firm’s own purpose-built tool. Judge Rankin revoked one attorney’s pro hac vice admission and fined him $3,000, fined two others $1,000 each, and held that “a finding of subjective bad faith is not required to impose sanctions.” A supervising partner was sanctioned on the strength of his signature alone. A thousand-lawyer firm with a custom tool still got burned, because the residual hallucination rate doesn’t care what your tooling cost.

It scales from there. In Lacey v. State Farm, a special master sanctioned two firms, one of them among the largest in the country, $31,100 jointly after roughly nine of 27 citations in a ten-page brief came back wrong, two of them pointing to cases that don’t exist. In Coomer v. Lindell, the defamation case against MyPillow founder Mike Lindell, two attorneys each drew a $3,000 fine for a brief with nearly 30 defective citations, and one was later sanctioned again, then a third time. The escalation tracks the conduct, not the technology. The stakes can also move past money: in an Alabama matter, three attorneys were reprimanded, removed from the case, and referred to the state bar for fabricated citations the court called “recklessness in the extreme.” The malpractice exposure that follows an unchecked filing is its own subject.

And then the case that proves the rule by breaking it. In United States v. Cohen, Michael Cohen used Google Bard, which he thought was a “super-charged search engine,” found three nonexistent cases, and passed them to his attorney, who filed them unchecked. No sanctions. Judge Furman called the citations “embarrassing and certainly negligent” but found no bad faith. Same fabrication, different outcome, because of candor. Verify, and a mistake stays a mistake. Conceal, and it becomes the thing you get punished for.

"Retrieval"

What retrieval fixes, and what it doesn’t

So why did the legal-specific tools beat the raw chatbot by such a wide margin? Because retrieval grounds an answer in real documents: the system pulls relevant material first, then writes its answer from that retrieved text. Anchoring the model to real sources cuts the guesswork that produces fake cases, which is why Lexis+ AI and Westlaw landed in the teens and thirties instead of GPT-4’s 43%. Grounding works. It just isn’t a cure.

Here’s the part the marketing skipped. Retrieval narrows the model’s focus and hands you a citation to click, but it isn’t a fact-checker. Nothing in the pipeline proves the retrieved passage is on point, or that the model read it right. Retrieval failure is the first mode: pull the wrong documents, or an incomplete slice of the right ones, and the answer inherits every flaw in that set. Misgrounding is the second, the same problem from the numbers above, a real, on-point-looking authority that doesn’t support the claim and survives a glance. The third you can watch for on the page: Stanford found Westlaw’s higher error rate tracked with longer answers, around 350 words against Lexis’s 219. More words mean more claims that can be wrong. And there’s a related risk, a model that agrees with a premise you got wrong, which showed up most in the general-chatbot study, where leading models often failed to correct a user’s mistaken legal assumption. One more reason the raw chatbot is the riskier place to start. And none of this is limited to legal research. The same misgrounding turns up in document review and e-discovery, when an AI tool summarizes a production or drafts from the record and points you to a page that doesn’t say what the tool claims. Wherever the answer points at a source, you still have to open the source.

Which brings us back to the claims that opened this. Several providers told the market their tools were “eliminating” hallucinations or were “hallucination-free.” Stanford tested those exact promises and called them “overstated.” That’s the honest frame. Retrieval buys you real gains in accuracy and a citation trail you can follow. What it doesn’t buy is permission to file without checking. The residual error rate is small, but it isn’t zero, and a single fabricated cite is enough to draw a Rule 11 sanction.

"The Duty"

Your verification duty under ABA Opinion 512

None of this is a new rule, and that’s the point most coverage misses. Generative AI didn’t create a duty to verify your citations. You already had one, and it doesn’t bend because the bad cite came from a machine instead of a junior associate.

ABA Formal Opinion 512, the bar’s first formal guidance on generative AI, anchors the duty in competence. Under Model Rule 1.1, it’s direct: relying on or submitting a GAI tool’s output “without an appropriate degree of independent verification or review of its output” could “violate the duty to provide competent representation as required by Model Rule 1.1.” It scales the review to the task, and folds in candor under Rule 3.3, requiring you, before you file, “to review these outputs, including analysis and citations to authority, and to correct errors.”

The rest of the rules are the ones you learned for the bar. Rule 1.1 Comment 8 already told you to keep abreast of “the benefits and risks associated with relevant technology.” Rules 5.1 and 5.3 put a partner’s duty to supervise the work an AI tool produces on managing lawyers, which now means a real AI policy and training. And Rule 1.6 is why you should think twice before you feed client facts into a public chatbot: self-learning tools “raise the risk that information relating to one client’s representation may be disclosed improperly.”

The bench isn’t waiting on the bar. Judge Brantley Starr in the Northern District of Texas issued the first federal standing order back in May 2023, and it doesn’t mince words: “These platforms in their current states are prone to hallucinations and bias. On hallucinations, they make stuff up, even quotes and citations.” A growing number of judges have followed, so you have to know when a court actually requires you to disclose AI use before you file. The orders haven’t stopped the problem. By late 2025, the scholar tracking these cases was logging five or six a day, and most firms still don’t have a written AI policy of their own.

So do the work. A defensible workflow, drawn from the standing orders and Opinion 512, runs six steps: use AI for first-pass research only; confirm every citation in a real database, that the case exists, the cite is right, and it stands for your point; read the full opinion; confirm it’s still good law; log who verified it and when; and check the assigned judge’s standing order before filing.

No tool gets you out of that, including ours, and you should be wary of one that says it does. This is the problem we built Hintyr to narrow. Hintyr is Agentic Document Review for small and mid-size firms, and the design choice we’ll defend is a narrow one: every answer is grounded in your own case documents, and every answer links back to the exact page it came from. That doesn’t make verification optional. It makes it fast, a click instead of a hunt. You’re still the lawyer.

California may make the duty statutory. SB 574 passed the state Senate 39 to 0 in January 2026 and is pending in the Assembly; it would require attorneys to verify AI-generated material, correct hallucinated output, and personally read and verify every citation in a filing. Either way, the instruction hasn’t changed. The tool can draft. You verify. Your signature says you did.

"Candor"

What courts actually punish

So here’s the lesson three years of sanctions keep teaching, and it has almost nothing to do with which tool you picked. The durable fix isn’t a vendor’s promise. It’s workflow and supervision: a person who reads the cases, a signature that means something, and a record showing the work got done.

Look at who walked away light. Cohen owned his mistake, the court found no bad faith, and no sanction followed. One firm pulled its motion with eight fake cases within a day, paid the other side, and trained its people; its partners drew fines in the low thousands. Now look at who didn’t. The lawyers in Mata doubled down and asked the chatbot to confirm its own fabrications. The attorney sanctioned three times in the Coomer matter kept defending the error instead of fixing it. The penalties climbed every time.

The technology will keep improving. The question a court asks when something slips through won’t change: did you check, and did you tell the truth when it mattered?

"Common Questions"

Frequently asked questions

What is a hallucination in legal AI?

It’s when an AI states something as fact that has no basis in fact. In law that means fabricated cases, invented quotes, and misstated holdings. The model isn’t looking anything up. It’s predicting plausible text, so when its training has a gap, it fills the gap with something that reads right and isn’t.

Is there such a thing as hallucination-free legal AI?

No, and you should be wary of anyone who says otherwise. Vendors have made that claim, and Stanford researchers tested it and called it overstated. Retrieval-grounded tools lower the error rate, but they don’t erase it. Treat the published 17% to 33% range as a floor, and verify every citation yourself.

Can AI really cite fake cases that don’t exist?

Yes, routinely. ChatGPT invented six fake decisions in the Avianca matter, with fake docket numbers and real judges listed as authors of opinions they never wrote. Damien Charlotin’s tracker has logged more than 1,500 court decisions worldwide involving AI-fabricated content, and the pace is climbing.

What happens to lawyers who file AI-hallucinated citations?

It depends mostly on what you do next. Courts have imposed Rule 11 fines, fee-shifting, pro hac vice revocation, and bar referrals. But the lawyers who owned the error fast often avoided the worst, while the ones who concealed it or doubled down drew escalating penalties.

This article is for general informational purposes only. It does not constitute legal advice and does not create an attorney-client relationship. Descriptions of court orders, the ABA Model Rules, ABA Formal Opinion 512, and the studies cited reflect publicly available sources as of May 2026 and may not address your jurisdiction or matter. Verify every authority against a primary source, and consult qualified counsel before relying on any of the discussion above.

Answers you can trace to the source.

Hintyr is Agentic Document Review for small and mid-size firms. Ask a question about your record and you get an answer with the citation attached, linked to the page it came from, so verification is a click and not a leap of faith. Always intuitive, always accurate, always cited.