Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Sentry Python Benchmark

30 prompts · getsentry/sentry · ~7,762 Python files · Django + Celery + Snuba · GR v3 vs Normal · community challenge open

All costs = Anthropic API charges · Claude Sonnet 4.6 (main agent) + Claude Haiku 4.5 (subagents spawned by Normal mode)

🐍

getsentry/sentry

github.com/getsentry/sentry ↗

Open-source error tracking · Python/Django · ~7,762 Python files · Celery + Snuba · REST API + ORM + integrations

78.4/100

GR v3 avg quality

78.6/100

Normal avg quality

43%

API cost saved vs Normal

$13.25

GR v3 API cost (30 prompts)

Costs = Anthropic API charges only (input + output + cache tokens). Normal mode spawns multiple Claude Haiku 4.5 subagents for search/grep operations — those API calls are included. GR v3 uses Claude Sonnet 4.6 exclusively. Same model, same prompts, same codebase.

Same quality. 43% cheaper in API costs.

At 7,762 files — the largest codebase tested — GR v3 totalled $13.25 in Anthropic API charges vs Normal's $23.14 — a 43% reduction at near-equal quality (78.4 vs 78.6/100). Normal's higher cost comes from the Haiku 4.5 subagents it spawns for bash/grep search passes — each prompt triggers multiple tool-use rounds that add up fast on a 7,762-file codebase. Think your setup can beat GR v3? Run the prompts and submit.

Featured — Cumulative Cost

Cumulative Cost — Watch the Money Add Up

GR v3: $13.25 · Normal: $23.1443% cheaper at equal quality · 30 prompts

Cumulative cost
Expand

All Charts

Summary: Quality & Cost

Avg quality and total cost — GR v3 vs Normal across 30 prompts

Summary: Quality & Cost
Expand

Per-Prompt Quality (all 30)

Line chart — every prompt quality score; MI region highlighted P21-P30

Per-Prompt Quality (all 30)
Expand

Quality by Category

GR v3 leads on architecture & code quality; Normal competitive on reliability

Quality by Category
Expand

Total Cost — 30 prompts

GR v3: $13.27 vs Normal: $23.15 — 43% cheaper at equal quality

Total Cost — 30 prompts
Expand

Quality vs Cost Scatter

Every prompt plotted — GR v3 clusters bottom-right (high Q, low cost)

Quality vs Cost Scatter
Expand

Value Wins per Mode

Prompts where each mode had best Quality÷Cost ratio

Value Wins per Mode
Expand

Avg Turns per Prompt

GR v3 uses fewest turns — efficient retrieval means less back-and-forth

Avg Turns per Prompt
Expand

Cost vs Agent Turns

Turn count vs cost — Normal has high variance; GR stays predictable

Cost vs Agent Turns
Expand

Category Quality Radar

Polar chart across 5 categories — GR v3 leads on architecture, security

Category Quality Radar
Expand

Cost Efficiency (Q/$)

Quality per dollar — GR v3 at 1.78 Q/$ vs Normal at 1.02 Q/$

Cost Efficiency (Q/$)
Expand

Community Challenge — Beat GR v3

GR v3 scored 78.4/100 avg quality at $13.25 total on getsentry/sentry (~7,762 Python files). Run the same 30 prompts with your own setup — any tool, any MCP, any CLAUDE.md — and submit. Beat us on quality, cost, or both.

Submit Results →

Methodology

run_sentry_4way.py — benchmark runner (simplified)Python
# 2 isolated copies of getsentry/sentry (~7,762 Python files)
# Each mode runs every prompt independently, no shared state

MODES = {
    "gr-v3": {
        "mcp":   "graperoot-v3",
        "model": "claude-sonnet-4-6",   # single agent, no subagents
    },
    "normal": {
        "mcp":   None,                  # bash + grep tools only
        "model": "claude-sonnet-4-6",   # main agent (Sonnet 4.6)
        # ↳ spawns claude-haiku-4-5 subagents for search operations
        #   those API calls are counted in the total cost
    },
}

# All $ figures = Anthropic API charges (input + output + cache tokens)
# Cost difference is real: Normal spawns 3–8 Haiku subagents per prompt

# Scoring: 5-dimension LLM judge (Claude Sonnet 4.6)
# findings_accuracy /20 · coverage_breadth /19 · depth_quality /15
# fix_completeness  /12 · actionability /8 · total /74 (mapped 0-100)

# Prompts 1-20: targeted · Prompts 21-30: multi-intent (labelled MI)

Benchmark run March 2026 · getsentry/sentry (~7,762 Python files) · 30 prompts · GR v3: Claude Sonnet 4.6 only · Normal: Sonnet 4.6 main + Haiku 4.5 subagents (all API costs included) · LLM judge: Claude Sonnet 4.6 · prompts + challenge at github.com/kunal12203/graperoot-benchmark-challenge.