Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

Featured · NewMarch 2026

GrapeRoot 4-Way Pro Benchmark

Does Boris Cherny's CLAUDE.md beat purpose-built MCP tools? We tested 4 modes across 30 code-audit prompts on medusajs/medusa (~1,571 TypeScript files). GR v3 wins on cost efficiency. Boris wins on quality.

$11.34

GR v3 total

$19.14

Boris total

72.7

GR avg quality

72.2

Boris avg quality

Prompts

View benchmark →

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts

10/10cost + quality winsclean sweep across challenge benchmark

34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Benchmark Tests

Challenge & Historical

Per-version analysis · v3.8.30–v3.8.39

Cost Evolution

Cost over version iterations

Expand

Category Cost

Cost by task category

Expand

Quality Scores

Quality across versions

Expand

Wall Time

Latency comparison

Expand

Token Volume

Input + output token usage

Expand

Cumulative Cost

Running cost across prompts

Expand

Win Rate

Win/loss/tie breakdown

Expand

Cost vs Quality

Every prompt plotted

Expand

Version Delta

Change vs prior version

Expand

Complex Prompts

Hardest prompt subset

Expand

Efficiency Radar

Multi-dimensional comparison

Expand

Challenge: Cost

Cost on challenge set

Expand

Challenge: Quality

Quality on challenge set

Expand

Challenge: Efficiency

Efficiency on challenge set

Expand

Challenge: Savings

Savings on challenge set

Expand

Full Cost Evolution

All versions, all costs

Expand

v3.8.39: 3-Way Cost

Three-way cost comparison

Expand

v3.8.39: 3-Way Quality

Three-way quality comparison

Expand

v3.8.39: Cost Improvement

Cost delta from prior version

Expand

v3.8.39: Win Rate

Win rate for v3.8.39

Expand

v3.8.39: Full Evolution

Full version cost evolution

Expand

v3.8.39: Before/After

Before/after comparison

Expand