TwinLadder logoTwinLadder
TwinLadder
TwinLadder logoTwinLadder
Back to Archive
TwinLadder Intelligence
Issue #2

TwinLadder Weekly

February 2025

TwinLadder Weekly

Issue #2 | February 2025


Stanford's Hallucination Study: What 17% Error Rate Really Means for Your Practice

The best legal AI tools still fabricate citations one in six times. Here's how to protect yourself.


If you've been following legal AI news, you've probably heard vendors claim their tools are "hallucination-free" or have "near-perfect accuracy." Stanford's researchers decided to test those claims.

The results should change how you think about every AI-assisted research task.

The Study That Changed the Conversation

Stanford RegLab published "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools" in the Journal of Empirical Legal Studies—the first preregistered empirical evaluation of AI-driven legal research tools.

The methodology was straightforward: create realistic legal research queries, run them through the leading AI tools, then manually verify every response and citation.

That last part—manual verification—turned out to be "extraordinarily time consuming," the researchers noted. Each response contained multiple citations, and each citation required independent verification.

What they found should give every practitioner pause.

The Numbers

Tool Hallucination Rate Accurate Responses
Lexis+ AI 17%+ 65%
Westlaw AI-Assisted Research 34%+ Not reported
Ask Practical Law AI 17%+ 18%
GPT-4 (no RAG) 69%+ Baseline comparison

Let's be clear about what these numbers mean:

Lexis+ AI, the best-performing commercial tool, hallucinates on roughly one in six queries. That's not a rounding error. That's a meaningful failure rate for a tool used in professional legal work.

Westlaw's AI hallucinates at nearly double that rate—one in three queries produced incorrect or misgrounded information.

GPT-4 without legal RAG hallucinated on more than two-thirds of queries—which is why you shouldn't use general-purpose AI for legal research without extensive verification.

Two Types of Hallucinations

The Stanford team identified two distinct failure modes:

1. Incorrect Responses The AI describes the law wrong or makes factual errors. It might tell you a statute says something it doesn't, or mischaracterize a holding.

2. Misgrounded Responses The AI describes the law correctly but cites sources that don't actually support the claim. This is particularly insidious—the answer sounds right, the citation looks legitimate, but the source doesn't say what the AI claims.

Both types can end your career if they make it into a filing.

Why This Matters More Than You Think

Consider what a 17% hallucination rate means in practice:

  • 5 research queries per day = ~1 hallucinated response daily
  • 25 queries per week = 4-5 potential errors weekly
  • 100 queries per month = 17 fabrications you need to catch

Now multiply that across a firm. A 20-lawyer shop doing moderate AI-assisted research could be generating dozens of hallucinations weekly—any one of which could become a sanctions motion, malpractice claim, or bar complaint.

The Courts Are Watching

While Stanford was publishing research, lawyers were getting sanctioned.

The Mata v. Avianca case became the cautionary tale: attorneys used ChatGPT for research, filed a brief with fabricated citations, and faced sanctions after opposing counsel (and the judge) noticed the non-existent cases.

But that was just the beginning.

According to Damien Charlotin's hallucination tracker, the pace has accelerated dramatically: "Before this spring in 2025, we maybe had two cases per week. Now we're at two cases per day or three cases per day."

More than 600 cases nationwide have involved lawyers citing non-existent authority due to AI use. Courts are responding with increasingly severe sanctions:

  • $10,000 fine in California for an appeal with 21 of 23 fake quotations
  • $3,000 per attorney in the MyPillow litigation for AI-generated fabrications
  • 90-day suspension for a Colorado attorney who failed to verify AI output
  • Removal as counsel in some cases
  • Mandatory reporting to bar grievance committees

The era of "I didn't know the AI made it up" as an excuse is over.

A New Wrinkle: Duty to Detect Opponent's Errors?

In a recent California case, the court declined to award attorneys' fees to opposing counsel partly because they failed to detect—or report—the fake citations in their opponent's brief.

This may be the first decision suggesting lawyers have some obligation to identify AI hallucinations in opposing filings, not just their own.

The implication: verification isn't just about protecting yourself. It may become a professional obligation.

The Vendor Response Problem

When the Stanford study first dropped, Thomson Reuters criticized the methodology because researchers had tested Ask Practical Law AI rather than Westlaw AI-Assisted Research (Thomson Reuters had denied access requests to the latter product).

After the study gained attention, Thomson Reuters granted access. The researchers re-ran their analysis.

The result? Westlaw's AI hallucinated at double the rate of Lexis.

This tells us something important about vendor claims: independent verification matters. The companies selling these tools have financial incentives to minimize reliability concerns. Stanford has no such incentive.

Trust the independent research.

What the Numbers Don't Tell You

The 17% and 34% figures are averages across query types. Your actual experience may vary based on:

Query complexity: Simple lookups likely perform better than nuanced legal analysis Jurisdiction: Some jurisdictions may have better training data coverage Practice area: Novel or specialized areas may see higher error rates Recency: Recent developments may not be well-represented

The study provides a baseline, not a guarantee. Your specific use case could be better—or worse.

Practical Protection: The Verification Protocol

Given these realities, here's how to use AI research tools responsibly:

Before You Start

  • Understand this is assistance, not replacement
  • Plan time for verification (budget 20-30% of "saved" time)
  • Never use AI for final drafts without human review

For Every AI Research Response

  • Verify existence: Does every cited case actually exist?
  • Verify citation: Is the citation format correct? (Volume, reporter, page)
  • Verify holding: Does the case actually say what the AI claims?
  • Verify currency: Has the case been overruled, distinguished, or limited?
  • Verify relevance: Is this actually the controlling jurisdiction?

Red Flags to Watch

  • Case names that sound too perfect for your argument
  • Citations you can't find in Westlaw/Lexis primary databases
  • Holdings that seem unusually broad or favorable
  • Cases from unexpected jurisdictions
  • Quotations that don't appear when you pull the actual opinion

Documentation

  • Screenshot AI outputs with timestamps
  • Note which portions were AI-assisted in your file
  • Document your verification steps

The Honest Assessment

Legal AI research tools represent genuine progress. A 17% hallucination rate is dramatically better than GPT-4's 69%. These tools can surface relevant authority faster than manual research.

But "better than terrible" isn't "good enough for professional use without verification."

The vendors want you to believe these tools can replace traditional research. They can't—not yet. What they can do is accelerate research that you then verify through traditional means.

Think of AI research tools as a first-pass filter, not a final answer. They help you find candidates for relevant authority. You still need to read the cases, verify the citations, and confirm the holdings yourself.

That's not a failure of the technology. It's an honest assessment of where we are in 2025.

Looking Ahead

The tools will improve. Hallucination rates will decrease. Eventually, we may reach reliability levels that justify reduced verification.

We're not there yet.

Until independent research—not vendor marketing—demonstrates substantially lower error rates, treat every AI research output as unverified until you've confirmed it yourself.

Your license depends on it.


Reliability Corner

Hallucination Rate Summary

Tool Category Hallucination Rate Verification Required
Legal RAG tools (Lexis) ~17% Always
Legal RAG tools (Westlaw) ~34% Always
General-purpose AI 60-75% Never use for legal research

This Week's Sanctions Report

MyPillow litigation: Two attorneys fined $3,000 each for AI-generated filing with "more than two dozen mistakes."

California appeal: $10,000 sanction for brief where 21 of 23 case quotations were fabricated.

Running total (2024-2025): 600+ documented cases of AI hallucination in legal filings nationwide.


Workflow of the Month: The Citation Verification Checklist

Print this. Use it for every AI-assisted research task.

CITATION VERIFICATION PROTOCOL

Case: _________________________________
AI Tool Used: __________________________
Date: _________________________________

□ Step 1: EXISTENCE CHECK
  - Searched in [Westlaw/Lexis] primary database
  - Case exists: YES / NO
  - If NO → STOP. Flag as hallucination.

□ Step 2: CITATION ACCURACY
  - Volume number correct: YES / NO
  - Reporter correct: YES / NO
  - Page number correct: YES / NO
  - Year correct: YES / NO

□ Step 3: HOLDING VERIFICATION
  - Located relevant passage in opinion: YES / NO
  - AI's characterization accurate: YES / NO
  - Quote appears verbatim (if quoted): YES / NO

□ Step 4: CURRENCY CHECK
  - Shepardized/KeyCited: YES / NO
  - Still good law: YES / NO
  - Any negative treatment: _______________

□ Step 5: RELEVANCE CONFIRMATION
  - Correct jurisdiction: YES / NO
  - Binding or persuasive: _______________
  - Applicable to our facts: YES / NO

VERIFIED BY: _____________ DATE: _________

Time investment: 5-10 minutes per citation

ROI: Your license, your reputation, your client's case


Quick Hits

Research Updates:

  • Stanford study now published in Journal of Empirical Legal Studies
  • Methodology available for replication at other institutions

Vendor News:

  • Both Lexis and Westlaw continuing to claim improved accuracy
  • No independent verification of vendor accuracy claims yet

Regulatory Watch:

  • State bars increasingly addressing AI use in ethics opinions
  • ABA formal guidance expected later this year

Coming Next Issue:

  • Contract Review AI Showdown: Which tools actually save time?

The Bottom Line

A 17% hallucination rate means the best tools fabricate information roughly once every six queries.

A 34% rate means one in three.

These numbers come from independent academic research, not marketing materials.

Every vendor will tell you their tool is reliable. Stanford's researchers actually tested them.

Build verification into every workflow. Document your process. Never file anything AI-generated without independent confirmation.

The technology is useful. The technology is not trustworthy. Both statements are true.


TwinLadder Weekly | Issue #2 | February 2025

Helping lawyers build AI capability through honest education.


Sources