Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

Featured · NewMarch 2026

GrapeRoot 4-Way Pro Benchmark

Does Boris Cherny's CLAUDE.md beat purpose-built MCP tools? We tested 4 modes across 30 code-audit prompts on medusajs/medusa (~1,571 TypeScript files). GR v3 wins on cost efficiency. Boris wins on quality.

$11.34

GR v3 total

$19.14

Boris total

72.7

GR avg quality

72.2

Boris avg quality

Prompts

View benchmark →

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts

10/10cost + quality winsclean sweep across challenge benchmark

34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Benchmark Tests

Key Results

Top charts across all benchmark runs

Per-Task Savings (v3.8.35)

Cost reduction per task type — up to 81%

Expand

Quality Comparison (v3.8.35)

GrapeRoot vs Normal — quality held or improved on every prompt

Expand

Win Rate Evolution

How GrapeRoot improved across benchmark runs

Expand

E2E Savings Waterfall

Per-prompt savings across 20 real-world tasks

Expand

E2E Quality (Regex + LLM Judge)

Dual-scored quality — regex coverage + LLM judge

Expand

Win Rate Grid

Win/loss/tie breakdown across all versions

Expand

Efficiency Radar

Multi-dimensional efficiency comparison

Expand

Cost vs Quality

Every prompt plotted: lower cost, higher quality

Expand

Cost Evolution (All Versions)

How each architecture iteration reduced cost

Expand

Bias Analysis

Expected vs actual winners — fair benchmark

Expand

Turns & Wall Time (E2E)

Fewer turns, faster responses across 20 prompts

Expand