TwinLadder logoTwinLadder
TwinLadder
TwinLadder logoTwinLadder
Back to Archive
TwinLadder Intelligence
Issue #8

TwinLadder Weekly

May 2025

TwinLadder Weekly

Issue #8 | May 2025


Harvey Goes Multi-Model: What Anthropic + Google Integration Means

Harvey drops single-model approach for intelligent orchestration. Here's why your legal AI workflow just got more complicated—and potentially more capable.


Last issue, we covered the SRA's approval of Garfield.Law as the first AI-only law firm. This issue, we analyze Harvey's strategic pivot from single-model dependency to multi-model orchestration—and what it signals for the legal AI market.

The Strategic Shift

On May 13, 2025, Harvey announced integration of foundation models from Google and Anthropic, transforming from a single-model consumer to an intelligent multi-model orchestrator.

This is noteworthy because Harvey is one of OpenAI Startup Fund's most successful early-backed portfolio companies. The decision to integrate competing models signals that model selection is becoming a strategic capability, not a vendor relationship.

How Multi-Model Routing Works

Harvey's platform now routes legal tasks to the most suitable model based on task type:

Task Type Optimal Model Why
Legal drafting Gemini 2.5 Pro Superior performance on extended document generation
Complex reasoning Claude 3.7 Sonnet / o1 Better handling of evidentiary analysis
Large document review Gemini 2.5 Pro 1M+ token context window
Research queries Model with superior recall Task-dependent selection
Jurisdiction-specific Regional training strength Varies by geography

The key insight: Like lawyers, modern models present different strengths, weaknesses, and biases.

The BigLaw Bench Evidence

Harvey's decision isn't arbitrary. Their BigLaw Bench testing revealed model-specific performance variations:

Gemini 2.5 Pro:

  • Excels at legal drafting tasks
  • Struggles with trial preparation and oral argument
  • Difficulties reasoning about complex evidentiary rules like hearsay

OpenAI o1 and Claude 3.7 Sonnet:

  • Stronger in complex reasoning scenarios
  • Better handling of evidentiary analysis
  • Superior performance on procedural considerations

Context Window Advantage: Gemini 2.5 Pro's 1 million token context window (expandable to 2 million) provides distinct advantages for processing extensive legal documentation—entire transaction rooms rather than individual documents.

Why This Matters for Practitioners

1. Single-Vendor Risk Reduction

Relying on one model provider creates operational risk. When OpenAI experiences outages or rate limits, single-model platforms go dark. Multi-model architecture provides fallback capability.

2. Task-Optimized Output Quality

Different legal tasks benefit from different model architectures. A memo requiring extended reasoning differs from a document review requiring massive context. Intelligent routing matches task to capability.

3. Competitive Pricing Pressure

With multiple viable providers, Harvey can negotiate better terms. This eventually flows to pricing pressure across the legal AI market.

4. Security Architecture Evolution

Both models are integrated through their respective cloud providers (AWS Bedrock, Google Vertex), with the same security and privacy guarantees. This signals growing enterprise acceptance of alternative providers beyond Microsoft Azure.

The Complexity Trade-Off

Multi-model isn't free lunch. New challenges include:

Consistency: Different models produce different outputs. The same prompt may yield varying results depending on routing. This creates predictability challenges for workflows expecting uniform behavior.

Testing Burden: Firms must now validate outputs across multiple model backends. Your prompt engineering may work perfectly on GPT-4 but fail on Claude or Gemini.

Audit Complexity: Which model produced which output? For compliance and malpractice purposes, tracking model provenance adds operational overhead.

Vendor Management: Instead of one relationship, Harvey now manages three. That complexity eventually surfaces somewhere in the product.

What This Signals for the Market

Harvey's move suggests several market dynamics:

1. Model commoditization is accelerating. If the leading legal AI platform treats models as interchangeable components, others will follow.

2. The integration layer becomes the moat. Harvey's value increasingly lies in task routing intelligence, not model access.

3. Enterprise security barriers are falling. Google and Anthropic have successfully addressed the concerns that previously limited enterprise legal AI to Azure/OpenAI only.

4. Specialization is the future. General-purpose models are giving way to task-specific selection.


Tool Review: Multi-Model Legal AI Platforms

Comparing approaches to model orchestration in legal AI

Harvey (Multi-Model)

Models: OpenAI GPT-4, Anthropic Claude, Google Gemini Selection: Automatic task-based routing Enterprise Status: 500+ customers, 50+ AmLaw 100 firms

Strengths:

  • Intelligent routing based on task type
  • Enterprise security across all providers
  • Fallback capability if one provider fails
  • Context window flexibility (Gemini's 1M+ tokens)

Limitations:

  • Output consistency varies by model
  • More complex audit trail
  • Premium pricing reflects infrastructure complexity

Best For: Large firms requiring maximum capability across diverse legal tasks Rating: 4.5/5 for enterprise deployments


CoCounsel (Thomson Reuters)

Models: Primarily GPT-4 based Selection: Single-model architecture Enterprise Status: Integrated with Westlaw, widely deployed

Strengths:

  • Consistent output behavior
  • Deep Westlaw integration
  • Established vendor relationship
  • Clear audit trail

Limitations:

  • Single-vendor dependency
  • Context window constraints
  • Less flexibility on task optimization

Best For: Firms prioritizing stability and Westlaw integration Rating: 4/5 for research-focused workflows


Lexis+ AI (LexisNexis)

Models: Multiple providers, details undisclosed Selection: Task-specific implementation Enterprise Status: Integrated with Lexis research platform

Strengths:

  • Native integration with LexisNexis content
  • Hallucination mitigation through citation verification
  • Familiar interface for Lexis users

Limitations:

  • Less transparency on model selection
  • Tied to LexisNexis ecosystem
  • Emerging capability vs. established competitors

Best For: Firms already invested in LexisNexis ecosystem Rating: 3.5/5 - improving rapidly


The Honest Assessment

Multi-model isn't automatically better. For firms with narrow, predictable workflows, single-model simplicity may outweigh routing benefits. For diverse practices handling everything from brief writing to due diligence, task-optimized routing delivers measurable improvement.

The question isn't "which model is best?" It's "which model is best for this specific task?"


What's Working: Multi-Model Success Stories

Success Story #1: The Due Diligence Transformation

Firm type: AmLaw 50, M&A practice Challenge: 2,000+ document data room review for acquisition

Before multi-model: "We'd hit context limits constantly. Splitting documents manually, losing track of cross-references. Associates spent more time managing the AI than reviewing documents."

After Harvey's Gemini integration: "The 1M token context window changed everything. We loaded entire document sets, asked questions across the full corpus. What took a week compressed to two days."

Key insight: Context window constraints were the bottleneck. Task-specific model selection addressed the actual limitation.


Success Story #2: The Brief That Needed Reasoning

Firm type: Mid-size litigation boutique Challenge: Complex evidentiary argument for motion in limine

Before multi-model: "GPT-4 kept producing surface-level analysis. It understood hearsay exceptions in isolation but couldn't reason through the interaction between 803(6), 807, and the confrontation clause implications."

After Claude routing: "Harvey routed the task to Claude 3.7 Sonnet. The reasoning depth improved dramatically. It worked through the exception stacking and identified potential confrontation clause issues we hadn't considered."

Key insight: Extended reasoning tasks benefit from models optimized for that capability. Not every LLM reasons the same way.


Hard Cases: Where Multi-Model Struggles

Hard Case #1: The Inconsistent Output Problem

Scenario: Partner reviews associate's AI-assisted memo. Three weeks later, same prompt produces different analysis.

Problem: Different model routed for same task. The first output came from Claude; the second from GPT-4. Substantively similar but stylistically different, with slightly different emphasis.

User frustration: "I can't build muscle memory for what the tool produces. Every time feels like working with a different associate."

Lesson: Consistency has value. Multi-model routing optimizes capability but sacrifices predictability.


Hard Case #2: The Audit Trail Challenge

Scenario: Client questions bill for "AI-assisted research" at 2 hours. Wants to know what the AI actually did.

Problem: Harvey processed the request across two models—initial research on one, synthesis on another. The audit log shows model switches but doesn't clearly explain why.

Client concern: "You charged me for two hours of AI work but can't tell me which AI did what? How do I know this was efficient?"

Lesson: Multi-model creates explainability challenges. Clients asking about AI usage deserve clear answers.


Hard Case #3: The Prompt Engineering Portability Problem

Scenario: Firm invested heavily in prompt libraries optimized for GPT-4.

Problem: Those prompts don't transfer perfectly to Claude or Gemini. Subtle differences in how models interpret instructions meant reworking the entire library.

Associate report: "Our carefully crafted prompts for contract review assumed GPT-4 behavior. Claude interprets some instructions differently. We're basically starting over."

Lesson: Prompt engineering isn't model-agnostic. Multi-model capability may require multi-model prompt development.


Reliability Corner

Harvey's Growth Metrics (May 2025)

Metric Value Source
Weekly Active Users 4x YoY growth Harvey blog
Enterprise Customers 500+ Harvey announcement
AmLaw 100 Coverage 50+ firms TechCrunch
Countries 53 Harvey blog
ARR (estimated) $75M+ Sacra estimates

Model Capability Comparison (BigLaw Bench)

Task Category GPT-4 Claude 3.7 Gemini 2.5 Pro
Legal Drafting Good Good Excellent
Complex Reasoning Good Excellent Moderate
Evidence Analysis Good Excellent Struggles
Large Context Limited Good Excellent
Oral Argument Prep Good Excellent Struggles

This Month's Perspective

The multi-model announcement isn't just about Harvey. It's a market signal that model selection is becoming a core capability for legal AI platforms. Firms evaluating AI tools should ask: "What models does this use, and how does it decide?"


Workflow of the Month: Multi-Model AI Evaluation Checklist

When evaluating legal AI tools that use multiple models, assess these factors:

MULTI-MODEL AI EVALUATION
==========================

TOOL: _____________________________
DATE: _____________________________
EVALUATOR: ________________________

MODEL TRANSPARENCY
[ ] Which models does the tool use?
    Models: _________________________
[ ] Is model selection disclosed per task?
    YES / NO / PARTIAL
[ ] Can users override automatic routing?
    YES / NO

CONSISTENCY ASSESSMENT
[ ] Same prompt, same output?
    Test 3x with identical input
    Result 1: _______________________
    Result 2: _______________________
    Result 3: _______________________
    Consistency rating: HIGH / MEDIUM / LOW

[ ] Do outputs vary by time of day?
    (Different load = different routing)
    YES / NO / UNTESTED

AUDIT TRAIL QUALITY
[ ] Does the tool log which model processed each task?
    YES / NO
[ ] Is the audit trail client-shareable?
    YES / NO / REDACTED VERSION
[ ] Can you explain model selection to a client?
    YES / PARTIALLY / NO

PROMPT PORTABILITY
[ ] Do your existing prompts work consistently?
    Test 5 standard prompts across models
    Working: ___/5
[ ] Does the vendor provide model-specific guidance?
    YES / NO

SECURITY VERIFICATION
[ ] Which cloud providers host each model?
    Provider 1: _____________________
    Provider 2: _____________________
    Provider 3: _____________________
[ ] Same security guarantees across all providers?
    YES / NO / VARIES
[ ] Data residency consistent across models?
    YES / NO

FALLBACK CAPABILITY
[ ] What happens if primary model is unavailable?
    _________________________________
[ ] Is there automatic failover?
    YES / NO
[ ] Does failover affect output quality?
    YES / NO / UNKNOWN

PRICING TRANSPARENCY
[ ] Does pricing vary by model used?
    YES / NO
[ ] Can you predict costs for specific tasks?
    YES / APPROXIMATELY / NO
[ ] Are expensive models charged at premium?
    _________________________________

RECOMMENDATION
[ ] Suitable for our use case: YES / NO / CONDITIONAL
[ ] Primary concern: _________________
[ ] Alternative if unsuitable: ________

VERIFIED BY: _____________ DATE: _______

Time investment: 30-45 minutes per tool Why it matters: Multi-model complexity requires explicit evaluation of consistency, auditability, and transparency.


Quick Hits

Harvey News:

Market Context:

Coming Next Issue:

  • Harvey Hits $5B Valuation: The 80x Revenue Multiple No One Questions

Ask the Community

Harvey's multi-model pivot raises questions we're tracking:

  1. For Harvey users: Have you noticed output differences since the multi-model integration? Better? Worse? Different?
  2. For firms evaluating AI: Is multi-model capability a requirement, nice-to-have, or irrelevant for your selection criteria?
  3. For IT/security teams: How does multi-model architecture affect your risk assessment?
  4. Prompt engineers: Are you maintaining model-specific prompt libraries? What's working?

Reply to share. Anonymized contributions welcome.


TwinLadder Weekly | Issue #8 | May 2025

Helping lawyers build AI capability through honest education.


Sources