Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Per-Task Savings (v3.8.35)

Cost reduction per task type — up to 81%

Per-Task Savings (v3.8.35)
Expand

Quality Comparison (v3.8.35)

GrapeRoot vs Normal — quality held or improved on every prompt

Quality Comparison (v3.8.35)
Expand

Win Rate Evolution

How GrapeRoot improved across benchmark runs

Win Rate Evolution
Expand

E2E Savings Waterfall

Per-prompt savings across 20 real-world tasks

E2E Savings Waterfall
Expand

E2E Quality (Regex + LLM Judge)

Dual-scored quality — regex coverage + LLM judge

E2E Quality (Regex + LLM Judge)
Expand

Win Rate Grid

Win/loss/tie breakdown across all versions

Win Rate Grid
Expand

Efficiency Radar

Multi-dimensional efficiency comparison

Efficiency Radar
Expand

Cost vs Quality

Every prompt plotted: lower cost, higher quality

Cost vs Quality
Expand

Cost Evolution (All Versions)

How each architecture iteration reduced cost

Cost Evolution (All Versions)
Expand

Bias Analysis

Expected vs actual winners — fair benchmark

Bias Analysis
Expand

Turns & Wall Time (E2E)

Fewer turns, faster responses across 20 prompts

Turns & Wall Time (E2E)
Expand