Benchmarks
Does it actually work? We ran the numbers.
5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.
GrapeRoot vs jCodeMunch
10 real-world coding tasks · one production codebase · two AI coding tools · no shortcuts
ColabNotes
github.com/kunal12203/collabnotes-benchmarkNotion-like collaborative notes app · Node.js + Express + TypeScript · Prisma + PostgreSQL + Redis · React frontend · ~197 files · 60 benchmark steps applied
84.7/100
GrapeRoot avg
34.5/100
jCodeMunch avg
8–2
GrapeRoot wins
6/10 steps
jCM hallucinated
Step-by-step scorecard
| Task | Cat. | jCM | GR |
|---|---|---|---|
| Fix N+1 Queries | Perf | 82↑ | 80 |
| Add DB Indexes | Perf | 15 | 78↑ |
| Redis Cache | Perf | 10 | 88↑ |
| Cursor Pagination | Perf | 10 | 86↑ |
| Helmet + CSP | Secu | 10 | 85↑ |
| CSRF Protection | Secu | 15 | 88↑ |
| XSS Sanitization | Secu | 10 | 90↑ |
| SQL Injection Audit | Secu | 20 | 78↑ |
| Auth Service Tests | Test | 88↑ | 87 |
| Block Service Tests | Test | 85 | 87↑ |
| Total / Avg | 34.5 | 84.7 |
jCodeMunch wrote
35 files
+5,121 lines
4,960 lines from step 59 alone (test-writing). Steps 52–58: 1 line total.
GrapeRoot wrote
91 files
+8,375 lines
Across all 10 tasks — including 866 for Redis, 3,095 for XSS, 641 for CSP.
By category
Performance
Security
Testing
Core finding
BM25 finds symbols that exist. Graphs know what doesn't.
jCodeMunch uses tree-sitter AST parsing + BM25 keyword search — not embeddings. It indexes every symbol, generates LLM summaries, and retrieves via token-frequency scoring. BM25 on symbol names is actually excellent for code: csrfProtection matches csrf reliably through tokenization.
The problem isn't the retrieval — it's what it retrieves. BM25 finds symbols that exist. If the codebase has any Redis-related symbol, search_symbols("redis cache") returns it, and Claude concludes the feature is implemented. GrapeRoot's symbol graph tracks the full call chain — if redisGet isn't in the symbol table, the cache layer doesn't exist.
Step 53 (Redis cache)
jCM: "The Redis cache layer is already fully implemented…"
GrapeRoot: wrote 866 lines across 6 files
Step 56 (CSRF)
jCM: "Everything requested is already fully implemented…" (1 line changed)
GrapeRoot: built HMAC token + double verification + frontend across 12 files
Step 57 (XSS)
jCM: "Everything is already implemented…"
GrapeRoot: replaced library + Zod transform + 21 unit tests across 11 files
When jCodeMunch is better
On pure test-writing tasks (steps 59–60), both tools scored ~87 and jCodeMunch was cheaper. Generative tasks that don't require understanding of what's absent from the codebase favour AST + BM25 — symbol retrieval is fast and accurate when you just need to find and read existing code.
Benchmark runner code
# Graperoot: dual-graph HTTP server
mcp_config = json.dumps({"mcpServers": {"graperoot": {
"type": "http", "url": f"http://127.0.0.1:{port}/mcp"
}}})
# jCodeMunch: SSE server
mcp_config = json.dumps({"mcpServers": {"jcodemunch": {
"type": "sse", "url": f"http://127.0.0.1:{port}/sse"
}}})
# Normal mode: zero MCP access
mcp_config = json.dumps({"mcpServers": {}})def run_claude(prompt: str, project_dir: Path, mcp_config: str) -> dict:
r = subprocess.run(
["claude", "--dangerously-skip-permissions",
"--output-format", "json",
"--model", "claude-sonnet-4-6",
"--strict-mcp-config", "--mcp-config", mcp_config,
"-p", prompt],
capture_output=True, text=True, cwd=str(project_dir),
)
data = json.loads(r.stdout)
return {
"response_text": data.get("result", ""),
"num_turns": data.get("num_turns", 0),
"total_cost_usd": data.get("total_cost_usd", 0),
}def llm_judge(prompt_text: str, response_text: str, diff: dict) -> dict:
judge_prompt = f"""Score this AI coding response 0-100.
TASK: {prompt_text[:800]}
RESPONSE: {response_text[:1500]}
CHANGES: +{diff['insertions']}/-{diff['deletions']} lines
80-100: Fully completed · 60-79: Mostly complete
40-59: Partial · 20-39: Wrong approach · 0-19: Failed
Reply ONLY: {{"score": <0-100>, "reason": "<one sentence>"}}"""
r = subprocess.run(
["claude", "--model", "claude-haiku-4-5-20251001",
"--strict-mcp-config", "--mcp-config", "{}",
"-p", judge_prompt],
cwd="/tmp", # no CLAUDE.md — judge is fully isolated
capture_output=True, text=True, timeout=120,
)
return json.loads(json.loads(r.stdout)["result"])Benchmark run March 2026 · Claude Sonnet 4.6 · 10 tasks on ColabNotes (~188 files, TypeScript) · LLM-as-judge via Claude Haiku 4.5 · isolated project copies from identical starting state.