E-Discovery

What Is RAG? Retrieval-Augmented Generation for Lawyers

You have heard RAG three times this month. Here is what it actually does for a law firm, where it earns the marketing, and where the Stanford data says it still misses.

Alexander Cohan, Ph.D.

Alexander Cohan, Ph.D.

Legal technology researcher and data scientist specializing in AI governance for litigation teams. Expertise in NLP and AI-assisted document review.

Key Takeaways

  • RAG was named in a 2020 Facebook AI paper. It pairs a retriever with a language model so answers cite real sources.
  • Stanford’s 2024 audit found leading legal RAG tools still hallucinated 17 to 33 percent of the time on the versions tested.
  • In a 2025 randomized trial, RAG-aided law students hallucinated about as often as no-AI peers and finished 14 to 37 percent faster.
  • ABA Formal Opinion 512 keeps the verification duty on you. Citations still need a click before they go in a brief.
  • More than 1,350 AI-citation cases are now logged worldwide. About 90 percent of involved U.S. firms are solo or small.
Open law book with a glowing network of nodes rising from the pages, illustrating RAG retrieval-augmented generation
"The plain definition"

What Is RAG Legal AI? A Definition for Lawyers

You already do RAG every time you Shepardize a citation by hand: you look up a source, then write something based on what it says. The acronym dresses up an old habit. A retriever fetches the passages most relevant to your question. A language model reads them. The model writes an answer that points back to the passages it just read.

That is what most lawyers really mean when they ask what is RAG legal AI. Retrieval-augmented generation, or RAG, is an AI architecture that grounds a language model in your own documents before it answers. The system searches a curated source set, hands the relevant passages to the model, and asks it to write only from what was retrieved. Citations point back to specific pages.

The recipe was named in 2020 by Patrick Lewis and a team of co-authors at Facebook AI Research, in a paper titled Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. They reported that RAG models “hallucinate less and generate factually correct text more often than BART,” the bare model they used as a baseline. Five years later, every legal AI vendor that matters runs some version of the same idea.

The marketing has gotten ahead of the architecture. Pablo Arredondo, who co-founded Casetext before Thomson Reuters bought it, put the practitioner version more plainly on a podcast: you “anchor the system in a search engine that will retrieve real results and then force GPT-4 to answer based on what it’s seeing in front of it, the real case law, not freestyling an answer.”

The distinction matters. A bare LLM is recall. It generates text from patterns it absorbed during training. RAG is research, then writing. The model still phrases the answer, but the substance comes from documents you control. For a litigator, that shift isn’t cosmetic. That’s RAG. The rest is engineering.

"Retrieve, then generate"

How Retrieval-Augmented Generation Works in Three Steps

Word search is fast and dumb. It finds “loss of consortium” but misses “spousal companionship damages.” That’s what RAG is built to fix.

Retrieval is the part of RAG that decides which paragraphs the model reads before it writes. Two ideas matter here: vector embeddings and vector search.

A vector embedding is a list of numbers that acts as a numerical address for the meaning of a sentence. Sentences with similar meanings sit at addresses close together. Sentences with different meanings sit far apart. Pinecone’s framing is that “the semantic similarity of these objects and concepts can be quantified by how close they are to each other as points in vector spaces.” A vector database is the index holding those addresses, answering “which paragraphs are nearest to this query?” in milliseconds.

For a lawyer, meaning beats keywords. Asked for “communications suggesting price coordination,” a keyword search for “price-fixing” misses the email reading “let’s both hold the line at $42 next quarter.” A dense retriever finds it. A 2020 paper from Karpukhin and co-authors at Facebook AI showed dense retrieval beats classic keyword scoring by 9 to 19 percentage points in top-20 passage retrieval accuracy, because “synonyms or paraphrases that consist of completely different tokens may still be mapped to vectors close to each other.”

Pure dense retrieval isn’t enough for legal text on its own. Statutes, case captions, and Bates numbers are exact strings; embeddings smear them. Hybrid search is the fix: run a keyword scorer alongside dense retrieval and merge the ranked lists. Hybrid catches Smith v. Jones, 412 F. Supp. 3d 198, and the paraphrased holding three pages later.

Two engineering tricks make this fast and accurate. Specialized graph indexes scale similarity search to billions of passages at sub-second latency. Then a reranker rescores the top results with a slower model that reads the query and each passage together. Fast first, slow second, careful by the end. That two-stage shape is how an AI agent reads your case file in a serious tool.

"The grounding gap"

RAG vs LLM: Why Grounding Beats a Bare Chatbot

Before RAG, asking ChatGPT a legal question was an exercise in faith. The model spoke fluently about cases that didn’t exist. In 2023, that habit produced Mata v. Avianca, the now-famous sanction against Steven Schwartz after he filed a brief built on ChatGPT’s Varghese v. China Southern Airlines, which wasn’t a real opinion. Judge Castel ordered Schwartz, his colleague, and the firm to pay $5,000 jointly, and the case became the cultural memory every legal AI conversation now starts from. RAG wouldn’t have prevented Schwartz from being lazy. It would have changed what laziness looked like, because the model would have been forced to pull from a real database before it answered.

The empirical point is sharper. A 2024 study in the Journal of Legal Analysis by Matthew Dahl and co-authors at Stanford and Yale measured how often general-purpose LLMs invented things in response to specific federal-case queries. The hallucination rate ran “between 58% of the time with ChatGPT 4 and 88% with Llama 2,” and rose to at least 75 percent on direct questions about a court’s holding. Lewis’s original RAG paper compared its grounded outputs to BART, the bare baseline, and found human evaluators judged BART more factual than RAG in only 7.1 percent of cases, with RAG more factual in 42.7 percent. The lift is real. AI malpractice risk and Rule 11 still apply, of course; nothing here changes the fact that you sign the brief.

What RAG actually improves comes down to three things. The model points to a source you can read. That source can be updated without retraining the model. And two reviewers can audit a citation chain instead of arguing about what the model “knew.” None of that gets the answer right by itself. It gets the answer to a place where you can check it.

"The hallucination ceiling"

Does RAG Solve AI Hallucination? The 17 to 33 Percent Reality

RAG narrows the failure modes. It doesn’t close them.

The strongest evidence is Stanford’s 2024 audit by Varun Magesh, Daniel Ho, and co-authors at the RegLab and HAI. They tested Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI as those products existed in early 2024 and measured hallucination rates between 17 and 33 percent. Lexis+ AI was the most accurate at 65 percent of queries, Westlaw came in at 41 percent, and Practical Law at 19 percent. Vendors have shipped updates since, and no peer-reviewed re-audit has been published as of April 2026. Even so, the original numbers reset what “hallucination-free” can credibly mean.

The harder finding is what kind of error RAG produces. Misgrounded citations were the dominant failure mode: the cited authority is real and clickable, but the proposition the AI attributes to it isn’t actually in the case. The Stanford team wrote that those errors “are potentially more dangerous than fabricating a case outright, because they are subtler and more difficult to spot.” A fake case fails the most basic Shepardization. A misgrounded one fails only when you read the cite. That’s the kind of error that makes it into a brief.

That’s the architecture’s ceiling, and it sits squarely on top of ABA Formal Opinion 512. The opinion holds that “a lawyer’s reliance on, or submission of, a GAI tool’s output, without an appropriate degree of independent verification or review of its output, could violate the duty to provide competent representation as required by Model Rule 1.1.” The cases line up too. Steven Schwartz, his colleague Peter LoDuca, and the firm of Levidow, Levidow & Oberman were jointly sanctioned $5,000 in Mata. Zachariah Crabill was suspended for one year and one day in Colorado in November 2023 after citing AI-fabricated cases and falsely blaming the error on an intern.

So what are you paying for? Fewer pages to read, not zero pages. A grounded answer points to specific pages of specific documents in your own file. The verification step Rule 11 already requires becomes a click instead of a research project, and that’s the entire economic case.

"Where it earns trust"

Where RAG Helps Small Firms Today (and Where It Stumbles)

If you only use RAG for one thing, use it where the documents already exist and the question is bounded. The wins so far cluster in three places.

Case-law search and authority lookup is the most-tested slot. Westlaw says its AI-Assisted Research uses RAG to keep the language model focused on the actual text of Westlaw content; Lexis+ AI markets linked citations on every answer. vLex’s Vincent AI is the RAG-aided tool from the only randomized controlled trial comparing these systems against a non-AI baseline. In Daniel Schwarcz’s 2025 study, 127 upper-level law students worked through six tasks under three conditions: no AI, OpenAI’s o1-preview reasoning model, or Vincent. The Vincent group finished 14 to 37 percent faster, posted productivity gains of 38 to 115 percent on five of the six tasks tested, and produced three hallucinations across the whole experiment. The bare-reasoning group produced eleven. Generalize that across “your firm” at your own risk; it’s one academic study with upper-level law students, not practicing lawyers. The honest read is that grounded retrieval cuts error rates and points at sources you can check.

Contract review and drafting works because the corpus is bounded. The firm playbook plus the document under review fits in a single retrieval index. Harvey, which served more than 100,000 lawyers across 1,300 organizations as of late 2025, leans on hybrid retrieval because “purely dense embeddings might struggle with rare terms such as case identifiers and named entities.” That’s the contract lawyer’s problem in one sentence: a thousand redlines look the same to a vector unless the index also remembers the captions and the defined terms. For clause-level redlines and the cost math behind them, see the GenAI vs TAR cost comparison.

Document review and privilege log generation is the last slot. It’s where the e-discovery workflow small firms use most often takes shape. Relativity’s aiR for Privilege drafts privilege log entries by identifying the people whose participation in a communication confers privilege. Privilege logs are document-by-document drafting, the anchored task RAG handles best because every assertion can be cited to a page. For pure responsiveness review, technology-assisted review remains the defensible path.

Hintyr is built around this exact pattern. It is an Agentic Document Review platform for small and mid-size firms, and every answer it produces is scoped to your own case files, anchored to the page you can click open, and refused outright when retrieval comes back empty. You see the citation before you see the conclusion. That is the point of the product, and it is the source of the line we keep coming back to: always intuitive, always accurate, always cited.

"Before you sign"

What to Ask a Vendor Before You Trust Their RAG

By the time you’re reading vendor pitches, the question isn’t “does this use RAG” but “how does this use RAG, and how do you know it works.” Four questions cover most of the ground.

  1. Show me a citation that links back to the source passage. Not a footnote. Not a list of “sources consulted.” A live link to the page where the cited proposition actually appears. If the demo cannot do this, the citation is decorative.
  2. Tell me your evaluation set and your error rate. Magesh and her co-authors wrote that “until vendors provide hard evidence of reliability, claims of hallucination-free legal AI systems will remain, at best, ungrounded.” A serious tool has tested itself against a holdout set the vendor did not write. Ask for the methodology; insist on a number. Vendors that cannot describe their evaluation often cannot describe their data handling either, which raises the same privilege exposure in consumer AI that consumer chatbots already raised.
  3. What happens when the answer isn’t in the corpus. The Vals AI Legal AI Report praised Vincent specifically for refusing to answer when the underlying database lacked the source. That refusal is a feature, not a bug. The wrong answer to “what should the tool do when it doesn’t know” is “make something up that sounds right.”
  4. Who supervises the model when it’s wrong. ABA Op. 512 says “lawyers must have a reasonable understanding of the capabilities and limitations of the specific GAI technology that the lawyer might use.” A vendor that can’t describe how it surfaces error to your supervising attorney is a vendor that hasn’t finished building the product.

This post is for informational purposes only and is not legal advice. RAG capabilities, vendor claims, and ethics guidance vary by jurisdiction and tool. Confirm your state bar’s current AI guidance and verify any AI-assisted citation against the underlying authority before relying on it in client work.

See What Cited Document Review Looks Like

Hintyr is Agentic Document Review for small and mid-size firms. Every answer is scoped to your case files and anchored to the exact page, so cited output is the default.