Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

Featured · NewMarch 2026

GrapeRoot 4-Way Pro Benchmark

Does Boris Cherny's CLAUDE.md beat purpose-built MCP tools? We tested 4 modes across 30 code-audit prompts on medusajs/medusa (~1,571 TypeScript files). GR v3 wins on cost efficiency. Boris wins on quality.

$11.34

GR v3 total

$19.14

Boris total

72.7

GR avg quality

72.2

Boris avg quality

Prompts

View benchmark →

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts

10/10cost + quality winsclean sweep across challenge benchmark

34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Benchmark Tests

E2E Benchmark

20 real-world multi-step prompts · v3.8.39

Per-Prompt Cost

Cost per prompt comparison

Expand

Quality (Regex + LLM)

Dual-scored quality

Expand

Turns & Wall Time

Fewer turns, faster responses

Expand

Savings Waterfall

Per-prompt cost savings

Expand

Quality Radar

Multi-axis quality comparison

Expand

Cost vs Quality

Every prompt: cost vs quality scatter

Expand

Classifier Decisions

How the classifier routed prompts

Expand

Bias Analysis

Expected vs actual winners

Expand

Scorer Agreement

Regex vs LLM judge agreement rate

Expand

Pack Size vs Savings

Context pack size correlation with savings

Expand