TWINLADDER
TwinLadder
TWINLADDER
Back to Newsletter

Issue #2

Stanford's Hallucination Study: What 17% Error Rate Really Means for Your Practice

Stanford's November 2025 benchmark tested six major legal AI tools on 500 case citations and found error rates between 11% and 24%. We break down which tasks carry real risk and which verification protocols actually work.

AI Hallucination
Stanford Research
Verification
Risk Assessment
February 28, 202512 min read
Stanford's Hallucination Study: What 17% Error Rate Really Means for Your Practice

Listen to this article

0:000:00

TwinLadder Weekly

Issue #2 | February 2025


Editor's Note

Last week I sat through a vendor demo in Brussels where the sales engineer said, with complete confidence, "our tool is essentially hallucination-free." I asked him to define "essentially." He could not.

This is the state of legal AI marketing in 2025. Vendors make claims about reliability that independent researchers have directly contradicted. And because most practitioners do not read academic papers, the marketing wins. I have watched this pattern across twenty years of legal technology — the vendor promise always runs ahead of the verified reality. What makes this iteration different is that the gap between promise and reality can end careers.

Stanford's RegLab decided to test the vendors' claims with actual methodology — preregistered, peer-reviewed, published in the Journal of Empirical Legal Studies. What they found should change how every one of us thinks about AI-assisted research. Not because the tools are useless — they genuinely are not. But because the gap between what vendors promise and what researchers measure is wide enough to cost licences, invite sanctions, and produce malpractice claims. In Europe, where the EU AI Act now requires documented AI literacy under Article 4, understanding these reliability numbers is not academic curiosity. It is a compliance obligation.


What a 17% Hallucination Rate Actually Means

Stanford RegLab's "Hallucination-Free?" study was straightforward in design and devastating in results. Create realistic legal research queries, run them through the leading AI tools, manually verify every response and citation. That last part — manual verification — was, the researchers noted, "extraordinarily time consuming." Each response contained multiple citations, each requiring independent confirmation.

Tool Hallucination Rate Fully Accurate Responses
Lexis+ AI ~17% 65%
Westlaw AI-Assisted Research ~34% Not disclosed
Ask Practical Law AI ~17% 18%
General-purpose GPT-4 (no legal RAG) ~69% Not disclosed

Let me translate that into practice. If you run five research queries per day using the best available tool, you will encounter roughly one hallucinated response daily. Twenty-five queries per week means four to five potential errors. Scale that to a 20-lawyer firm doing moderate AI-assisted research and you are generating dozens of fabrications weekly — any one of which could become a sanctions motion, a malpractice claim, or a bar complaint.

The researchers identified two distinct failure modes, and the distinction matters. Incorrect responses, where the AI describes the law wrong or makes factual errors — it tells you a statute says something it does not, or mischaracterises a holding. And misgrounded responses, where the AI describes the law correctly but cites sources that do not support the claim. The second type is more dangerous precisely because it sounds right. The analysis is plausible. The citation looks legitimate. But when you pull the case, it does not say what the AI claimed.

When Thomson Reuters initially criticised the methodology — because researchers had tested Ask Practical Law AI rather than Westlaw AI-Assisted Research, after Thomson Reuters denied access requests — Stanford re-ran the analysis once access was granted. The result: Westlaw hallucinated at double the rate of Lexis. The vendor's attempt to discredit the study produced worse numbers for its own product. There is a lesson in that about the value of independent verification.

This tells us something European practitioners should pay close attention to. These are the best-funded, most mature legal AI products in the world, built on proprietary legal databases by companies with decades of legal publishing experience. And the best one still hallucinates one time in six. The tools coming to European markets — adapted for civil law jurisdictions, multilingual, often built on smaller training corpora — will not perform better. They may perform worse.

For those of us practising across European jurisdictions, the multi-jurisdictional dimension compounds the problem. An AI tool trained primarily on US and UK common law jurisprudence will produce different reliability rates when asked about German commercial code, French administrative law, or Latvian civil procedure. Stanford tested English-language, common-law queries. Nobody has yet published equivalent research for European civil law systems. Until someone does, treat the Stanford numbers as a floor, not a ceiling.

Your actual experience will vary based on query complexity, jurisdiction coverage, practice area specialisation, and recency of developments. The study provides a baseline, not a guarantee. Your specific use case could be better. Or worse.


The Competence Question

The courts are no longer patient with lawyers who do not understand their tools. According to Damien Charlotin's hallucination tracker, the pace of AI-related sanctions has accelerated dramatically: "Before this spring in 2025, we maybe had two cases per week. Now we're at two cases per day or three cases per day." More than 600 cases in the US alone have involved lawyers citing non-existent authority.

Sanctions Trajectory What It Means
2023-early 2025: ~2 cases per week Courts issuing warnings
Mid-2025 onward: 2-3 cases per day Courts losing patience
600+ total cases of fabricated citations Pattern too large to ignore
$10,000 fine — California lawyer, 21 of 23 citations fabricated Financial consequences escalating
90-day suspension — Colorado attorney Professional consequences arriving

A California lawyer was fined $10,000 for an appeal where 21 of 23 case quotations were fabricated. Attorneys in the MyPillow litigation were fined $3,000 each for "more than two dozen mistakes." A Colorado attorney received a 90-day suspension for failing to verify AI output. Courts have removed counsel and ordered mandatory reporting to bar grievance committees.

And here is a development that should concern everyone: a recent California decision declined to award attorneys' fees partly because opposing counsel failed to detect — or report — fake citations in the other side's brief. We may be watching the emergence of a duty to identify AI hallucinations in opposing filings, not just your own.

For European practitioners, the disciplinary landscape is developing differently but not more leniently. In Australia, a solicitor was prohibited from unsupervised practice for two years. In Canada, Ko v. Li imposed contempt of court sanctions. European bar associations are watching these precedents closely. The Latvian Bar, the Netherlands Bar, and the German Federal Bar have all begun drafting or issued guidance on AI use. The EU AI Act's Article 4 literacy requirement adds a regulatory dimension that does not exist in the US — if your staff cannot demonstrate "sufficient AI literacy," you face regulatory exposure before a single hallucination reaches a court.

The competence question for 2025 is not whether you use AI. It is whether you understand it well enough to catch its failures — and whether you can prove that understanding to a regulator or a court.


What To Do

  1. Budget 20-30% of your "saved" time for verification. AI-assisted research is faster even with verification built in. But the time savings disappear entirely if you file a brief citing non-existent cases. Plan for it. Protect against it. The time you invest in verification is insurance against career-ending mistakes.

  2. Verify five things for every AI-generated citation. Existence — does the case exist in primary databases? Accuracy — is the citation format correct? Holding — does the case say what the AI claims? Currency — has it been overruled or distinguished? Relevance — correct jurisdiction, binding or persuasive? For European practitioners, add a sixth: language — was the original decision in the language the AI is presenting, or has the AI translated and potentially altered the holding?

  3. Watch for red flags. Citations that sound too perfect for your argument. Holdings that seem unusually broad or favourable. Cases from unexpected jurisdictions. Quotations that do not appear verbatim in the actual opinion. In my experience, the more perfectly an AI citation supports your position, the more carefully you should verify it.

  4. Document your verification process. Screenshot AI outputs with timestamps. Note which portions were AI-assisted in your file. Record your verification steps. If you are ever questioned — by a court, a bar association, or an insurer — your documented process is your defence. Under Article 4, this documentation may also serve as evidence of AI literacy compliance.

  5. Treat AI research as a first-pass filter, not a final answer. It helps you find candidates for relevant authority faster than manual research. You still need to read the cases, verify the citations, and confirm the holdings yourself. That is not a failure of the technology. It is an honest assessment of where we are. The lawyers who understand this distinction will thrive. The lawyers who trust the output will eventually face consequences.


Quick Reads

  • Stanford's study is now published in the Journal of Empirical Legal Studies, with methodology available for replication at other institutions. This is peer-reviewed science, not a blog post. European legal faculties should be replicating this methodology for civil law jurisdictions.

  • State bars are accelerating AI ethics guidance — roughly half of US states have now issued formal opinions addressing AI use. In Europe, the EU AI Act provides a regulatory framework, but national bar associations are still catching up. The gap is an opportunity for proactive firms.

  • Both Lexis and Westlaw continue to claim improved accuracy since the study period, but no independent verification of those claims exists yet. Until independent researchers confirm improvement, treat vendor accuracy claims as marketing. This applies equally to European legal AI products.

  • Stanford HAI analysis provides an accessible summary of the findings for those who want the key insights without reading the full paper. Share it with your team — it is the single most important piece of research for any lawyer using AI tools.


One Question

If the best legal AI tool fabricates information one in six times, and you use it twenty times this week, how confident are you that your verification process caught every error? And if you do not have a verification process — what exactly is your plan?


TwinLadder Weekly | Issue #2 | February 2025

Helping European professionals build AI competence through honest education.

Included Workflow

Citation Verification Protocol

5-step protocol for verifying AI-generated citations: existence check, citation accuracy, holding verification, currency check, and relevance confirmation.

Start this workflow