Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

E2E Per-Prompt Cost

Cost per prompt, E2E benchmark

E2E Per-Prompt Cost
Expand

E2E Quality (Dual)

Regex + LLM judge dual scoring

E2E Quality (Dual)
Expand

3-Way Cost (CGC)

GrapeRoot vs Normal vs CGC cost

3-Way Cost (CGC)
Expand

Quality by Level

Quality breakdown by difficulty level

Quality by Level
Expand

Win Matrix

Win/loss matrix across configurations

Win Matrix
Expand

E2E Savings Waterfall

Per-prompt savings, E2E run

E2E Savings Waterfall
Expand

Win Rate Evolution

Win rate improvement over time

Win Rate Evolution
Expand

E2E Turns

Turn count comparison

E2E Turns
Expand

GR vs CGC Savings

GrapeRoot vs CGC savings delta

GR vs CGC Savings
Expand

E2E Wall Time

Total wall time per prompt

E2E Wall Time
Expand

Total Spend

Aggregate spend across all runs

Total Spend
Expand