Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

GrapeRoot vs jCodeMunch

10 real-world coding tasks · one production codebase · two AI coding tools · no shortcuts

📝

ColabNotes

github.com/kunal12203/collabnotes-benchmark

Notion-like collaborative notes app · Node.js + Express + TypeScript · Prisma + PostgreSQL + Redis · React frontend · ~197 files · 60 benchmark steps applied

84.7/100

GrapeRoot avg

34.5/100

jCodeMunch avg

8–2

GrapeRoot wins

6/10 steps

jCM hallucinated

Step-by-step scorecard

TaskCat.jCMGR
Fix N+1 QueriesPerf8280
Add DB IndexesPerf1578
Redis CachePerf1088
Cursor PaginationPerf1086
Helmet + CSPSecu1085
CSRF ProtectionSecu1588
XSS SanitizationSecu1090
SQL Injection AuditSecu2078
Auth Service TestsTest8887
Block Service TestsTest8587
Total / Avg34.584.7

jCodeMunch wrote

35 files

+5,121 lines

4,960 lines from step 59 alone (test-writing). Steps 52–58: 1 line total.

GrapeRoot wrote

91 files

+8,375 lines

Across all 10 tasks — including 866 for Redis, 3,095 for XSS, 641 for CSP.

By category

Performance

jCodeMunch avg29.3/100
GrapeRoot avg83/100
jCM cost$0.99
GR cost$2.14

Security

jCodeMunch avg13.75/100
GrapeRoot avg85.25/100
jCM cost$0.82
GR cost$2.85

Testing

jCodeMunch avg86.5/100
GrapeRoot avg87/100
jCM cost$1.61
GR cost$1.80

Core finding

BM25 finds symbols that exist. Graphs know what doesn't.

jCodeMunch uses tree-sitter AST parsing + BM25 keyword search — not embeddings. It indexes every symbol, generates LLM summaries, and retrieves via token-frequency scoring. BM25 on symbol names is actually excellent for code: csrfProtection matches csrf reliably through tokenization.

The problem isn't the retrieval — it's what it retrieves. BM25 finds symbols that exist. If the codebase has any Redis-related symbol, search_symbols("redis cache") returns it, and Claude concludes the feature is implemented. GrapeRoot's symbol graph tracks the full call chain — if redisGet isn't in the symbol table, the cache layer doesn't exist.

Step 53 (Redis cache)

jCM: "The Redis cache layer is already fully implemented…"

GrapeRoot: wrote 866 lines across 6 files

Step 56 (CSRF)

jCM: "Everything requested is already fully implemented…" (1 line changed)

GrapeRoot: built HMAC token + double verification + frontend across 12 files

Step 57 (XSS)

jCM: "Everything is already implemented…"

GrapeRoot: replaced library + Zod transform + 21 unit tests across 11 files

When jCodeMunch is better

On pure test-writing tasks (steps 59–60), both tools scored ~87 and jCodeMunch was cheaper. Generative tasks that don't require understanding of what's absent from the codebase favour AST + BM25 — symbol retrieval is fast and accurate when you just need to find and read existing code.

Benchmark runner code

runner.py — MCP config per modePython · simplified
# Graperoot: dual-graph HTTP server
mcp_config = json.dumps({"mcpServers": {"graperoot": {
    "type": "http", "url": f"http://127.0.0.1:{port}/mcp"
}}})

# jCodeMunch: SSE server
mcp_config = json.dumps({"mcpServers": {"jcodemunch": {
    "type": "sse", "url": f"http://127.0.0.1:{port}/sse"
}}})

# Normal mode: zero MCP access
mcp_config = json.dumps({"mcpServers": {}})
runner.py — running a stepPython · simplified
def run_claude(prompt: str, project_dir: Path, mcp_config: str) -> dict:
    r = subprocess.run(
        ["claude", "--dangerously-skip-permissions",
         "--output-format", "json",
         "--model", "claude-sonnet-4-6",
         "--strict-mcp-config", "--mcp-config", mcp_config,
         "-p", prompt],
        capture_output=True, text=True, cwd=str(project_dir),
    )
    data = json.loads(r.stdout)
    return {
        "response_text":  data.get("result", ""),
        "num_turns":      data.get("num_turns", 0),
        "total_cost_usd": data.get("total_cost_usd", 0),
    }
runner.py — LLM-as-judge scoringPython · simplified
def llm_judge(prompt_text: str, response_text: str, diff: dict) -> dict:
    judge_prompt = f"""Score this AI coding response 0-100.
TASK: {prompt_text[:800]}
RESPONSE: {response_text[:1500]}
CHANGES: +{diff['insertions']}/-{diff['deletions']} lines

80-100: Fully completed · 60-79: Mostly complete
40-59: Partial · 20-39: Wrong approach · 0-19: Failed

Reply ONLY: {{"score": <0-100>, "reason": "<one sentence>"}}"""

    r = subprocess.run(
        ["claude", "--model", "claude-haiku-4-5-20251001",
         "--strict-mcp-config", "--mcp-config", "{}",
         "-p", judge_prompt],
        cwd="/tmp",  # no CLAUDE.md — judge is fully isolated
        capture_output=True, text=True, timeout=120,
    )
    return json.loads(json.loads(r.stdout)["result"])

Benchmark run March 2026 · Claude Sonnet 4.6 · 10 tasks on ColabNotes (~188 files, TypeScript) · LLM-as-judge via Claude Haiku 4.5 · isolated project copies from identical starting state.