Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Cost Evolution

Cost over version iterations

Cost Evolution
Expand

Category Cost

Cost by task category

Category Cost
Expand

Quality Scores

Quality across versions

Quality Scores
Expand

Wall Time

Latency comparison

Wall Time
Expand

Token Volume

Input + output token usage

Token Volume
Expand

Cumulative Cost

Running cost across prompts

Cumulative Cost
Expand

Win Rate

Win/loss/tie breakdown

Win Rate
Expand

Cost vs Quality

Every prompt plotted

Cost vs Quality
Expand

Version Delta

Change vs prior version

Version Delta
Expand

Complex Prompts

Hardest prompt subset

Complex Prompts
Expand

Efficiency Radar

Multi-dimensional comparison

Efficiency Radar
Expand

Challenge: Cost

Cost on challenge set

Challenge: Cost
Expand

Challenge: Quality

Quality on challenge set

Challenge: Quality
Expand

Challenge: Efficiency

Efficiency on challenge set

Challenge: Efficiency
Expand

Challenge: Savings

Savings on challenge set

Challenge: Savings
Expand

Full Cost Evolution

All versions, all costs

Full Cost Evolution
Expand

v3.8.39: 3-Way Cost

Three-way cost comparison

v3.8.39: 3-Way Cost
Expand

v3.8.39: 3-Way Quality

Three-way quality comparison

v3.8.39: 3-Way Quality
Expand

v3.8.39: Cost Improvement

Cost delta from prior version

v3.8.39: Cost Improvement
Expand

v3.8.39: Win Rate

Win rate for v3.8.39

v3.8.39: Win Rate
Expand

v3.8.39: Full Evolution

Full version cost evolution

v3.8.39: Full Evolution
Expand

v3.8.39: Before/After

Before/after comparison

v3.8.39: Before/After
Expand