Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

Featured · NewMarch 2026

GrapeRoot 4-Way Pro Benchmark

Does Boris Cherny's CLAUDE.md beat purpose-built MCP tools? We tested 4 modes across 30 code-audit prompts on medusajs/medusa (~1,571 TypeScript files). GR v3 wins on cost efficiency. Boris wins on quality.

$11.34

GR v3 total

$19.14

Boris total

72.7

GR avg quality

72.2

Boris avg quality

Prompts

View benchmark →

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts

10/10cost + quality winsclean sweep across challenge benchmark

34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Benchmark Tests

Comprehensive Benchmark

Full cross-run analysis across all versions

E2E Per-Prompt Cost

Cost per prompt, E2E benchmark

Expand

E2E Quality (Dual)

Regex + LLM judge dual scoring

Expand

3-Way Cost (CGC)

GrapeRoot vs Normal vs CGC cost

Expand

Quality by Level

Quality breakdown by difficulty level

Expand

Win Matrix

Win/loss matrix across configurations

Expand

E2E Savings Waterfall

Per-prompt savings, E2E run

Expand

Win Rate Evolution

Win rate improvement over time

Expand

E2E Turns

Turn count comparison

Expand

GR vs CGC Savings

GrapeRoot vs CGC savings delta

Expand

E2E Wall Time

Total wall time per prompt

Expand

Total Spend

Aggregate spend across all runs

Expand