Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

Featured · NewMarch 2026

GrapeRoot 4-Way Pro Benchmark

Does Boris Cherny's CLAUDE.md beat purpose-built MCP tools? We tested 4 modes across 30 code-audit prompts on medusajs/medusa (~1,571 TypeScript files). GR v3 wins on cost efficiency. Boris wins on quality.

$11.34

GR v3 total

$19.14

Boris total

72.7

GR avg quality

72.2

Boris avg quality

Prompts

View benchmark →

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts

10/10cost + quality winsclean sweep across challenge benchmark

34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Benchmark Tests

GrapeRoot vs jCodeMunch

10 real-world coding tasks · one production codebase · two AI coding tools · no shortcuts

📝

ColabNotes

github.com/kunal12203/collabnotes-benchmark

Notion-like collaborative notes app · Node.js + Express + TypeScript · Prisma + PostgreSQL + Redis · React frontend · ~197 files · 60 benchmark steps applied

84.7/100

GrapeRoot avg

34.5/100

jCodeMunch avg

8–2

GrapeRoot wins

6/10 steps

jCM hallucinated

Step-by-step scorecard

Task	Cat.	jCM	GR	jCM files	GR files	jCM lines	GR lines
Fix N+1 Queries	Perf	82↑	80	2	7	+56	+105
Add DB Indexes	Perf	15	78↑	0	2	+0	+9
Redis Cache	Perf	10	88↑	0	6	+0	+866
Cursor Pagination	Perf	10	86↑	0	9	+0	+252
Helmet + CSP	Secu	10	85↑	0	10	+0	+641
CSRF Protection	Secu	15	88↑	1	12	+1	+558
XSS Sanitization	Secu	10	90↑	0	11	+0	+3,095
SQL Injection Audit	Secu	20	78↑	0	6	+0	+317
Auth Service Tests	Test	88↑	87	29	18	+4,960	+1,792
Block Service Tests	Test	85	87↑	3	10	+104	+740
Total / Avg		34.5	84.7	35 files	91 files	+5,121	+8,375

jCodeMunch wrote

35 files

+5,121 lines

4,960 lines from step 59 alone (test-writing). Steps 52–58: 1 line total.

GrapeRoot wrote

91 files

+8,375 lines

Across all 10 tasks — including 866 for Redis, 3,095 for XSS, 641 for CSP.

By category

Performance

jCodeMunch avg29.3/100

GrapeRoot avg83/100

jCM cost$0.99

GR cost$2.14

Security

jCodeMunch avg13.75/100

GrapeRoot avg85.25/100

jCM cost$0.82

GR cost$2.85

Testing

jCodeMunch avg86.5/100

GrapeRoot avg87/100

jCM cost$1.61

GR cost$1.80

Core finding

BM25 finds symbols that exist. Graphs know what doesn't.

jCodeMunch uses tree-sitter AST parsing + BM25 keyword search — not embeddings. It indexes every symbol, generates LLM summaries, and retrieves via token-frequency scoring. BM25 on symbol names is actually excellent for code: csrfProtection matches csrf reliably through tokenization.

The problem isn't the retrieval — it's what it retrieves. BM25 finds symbols that exist. If the codebase has any Redis-related symbol, search_symbols("redis cache") returns it, and Claude concludes the feature is implemented. GrapeRoot's symbol graph tracks the full call chain — if redisGet isn't in the symbol table, the cache layer doesn't exist.

Step 53 (Redis cache)

jCM: "The Redis cache layer is already fully implemented…"

GrapeRoot: wrote 866 lines across 6 files

Step 56 (CSRF)

jCM: "Everything requested is already fully implemented…" (1 line changed)

GrapeRoot: built HMAC token + double verification + frontend across 12 files

Step 57 (XSS)

jCM: "Everything is already implemented…"

GrapeRoot: replaced library + Zod transform + 21 unit tests across 11 files

When jCodeMunch is better

On pure test-writing tasks (steps 59–60), both tools scored ~87 and jCodeMunch was cheaper. Generative tasks that don't require understanding of what's absent from the codebase favour AST + BM25 — symbol retrieval is fast and accurate when you just need to find and read existing code.

Benchmark runner code

runner.py — MCP config per modePython · simplified

# Graperoot: dual-graph HTTP server
mcp_config = json.dumps({"mcpServers": {"graperoot": {
    "type": "http", "url": f"http://127.0.0.1:{port}/mcp"
}}})

# jCodeMunch: SSE server
mcp_config = json.dumps({"mcpServers": {"jcodemunch": {
    "type": "sse", "url": f"http://127.0.0.1:{port}/sse"
}}})

# Normal mode: zero MCP access
mcp_config = json.dumps({"mcpServers": {}})

runner.py — running a stepPython · simplified

def run_claude(prompt: str, project_dir: Path, mcp_config: str) -> dict:
    r = subprocess.run(
        ["claude", "--dangerously-skip-permissions",
         "--output-format", "json",
         "--model", "claude-sonnet-4-6",
         "--strict-mcp-config", "--mcp-config", mcp_config,
         "-p", prompt],
        capture_output=True, text=True, cwd=str(project_dir),
    )
    data = json.loads(r.stdout)
    return {
        "response_text":  data.get("result", ""),
        "num_turns":      data.get("num_turns", 0),
        "total_cost_usd": data.get("total_cost_usd", 0),
    }

runner.py — LLM-as-judge scoringPython · simplified

def llm_judge(prompt_text: str, response_text: str, diff: dict) -> dict:
    judge_prompt = f"""Score this AI coding response 0-100.
TASK: {prompt_text[:800]}
RESPONSE: {response_text[:1500]}
CHANGES: +{diff['insertions']}/-{diff['deletions']} lines

80-100: Fully completed · 60-79: Mostly complete
40-59: Partial · 20-39: Wrong approach · 0-19: Failed

Reply ONLY: {{"score": <0-100>, "reason": "<one sentence>"}}"""

    r = subprocess.run(
        ["claude", "--model", "claude-haiku-4-5-20251001",
         "--strict-mcp-config", "--mcp-config", "{}",
         "-p", judge_prompt],
        cwd="/tmp",  # no CLAUDE.md — judge is fully isolated
        capture_output=True, text=True, timeout=120,
    )
    return json.loads(json.loads(r.stdout)["result"])

Benchmark run March 2026 · Claude Sonnet 4.6 · 10 tasks on ColabNotes (~188 files, TypeScript) · LLM-as-judge via Claude Haiku 4.5 · isolated project copies from identical starting state.