Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Per-Prompt Cost

Cost per prompt comparison

Per-Prompt Cost
Expand

Quality (Regex + LLM)

Dual-scored quality

Quality (Regex + LLM)
Expand

Turns & Wall Time

Fewer turns, faster responses

Turns & Wall Time
Expand

Savings Waterfall

Per-prompt cost savings

Savings Waterfall
Expand

Quality Radar

Multi-axis quality comparison

Quality Radar
Expand

Cost vs Quality

Every prompt: cost vs quality scatter

Cost vs Quality
Expand

Classifier Decisions

How the classifier routed prompts

Classifier Decisions
Expand

Bias Analysis

Expected vs actual winners

Bias Analysis
Expand

Scorer Agreement

Regex vs LLM judge agreement rate

Scorer Agreement
Expand

Pack Size vs Savings

Context pack size correlation with savings

Pack Size vs Savings
Expand