Benchmarks
Does it actually work? We ran the numbers.
5 benchmark runs · 80+ prompts · real 278-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.
GrapeRoot vs code-review-graph vs jCodeMunch
10 production tasks · one codebase · three graph-based AI tools · proactive vs reactive context
ColabNotes
github.com/kunal12203/collabnotes-benchmarkNotion-like collaborative notes app · Node.js + Express + TypeScript · Prisma + PostgreSQL + Redis · React frontend · ~197 files · 60 benchmark steps applied
84.7/100
GrapeRoot avg
71.5/100
code-review-graph avg
34.5/100
jCodeMunch avg
8–0 vs CRG
GrapeRoot wins
Step-by-step scorecard
| Task | Cat. | CRG | GR |
|---|---|---|---|
| Fix N+1 Queries | Perf | 70 | 80↑ |
| Add DB Indexes | Perf | 72 | 78↑ |
| Redis Cache | Perf | 76 | 88↑ |
| Cursor Pagination | Perf | 72 | 86↑ |
| Helmet + CSP | Secu | 78⚠ | 85↑ |
| CSRF Protection | Secu | 62⚠ | 88↑ |
| XSS Sanitization | Secu | 72 | 90↑ |
| SQL Injection Audit | Secu | 65 | 78↑ |
| Auth Service Tests | Test | 68 | 87↑ |
| Block Service Tests | Test | 80 | 87↑ |
| Total / Avg | 71.5 | 84.7 |
⚠ = git hygiene issue (node_modules or generated files committed)
code-review-graph wrote
130 files
+11,416 lines
Real implementations on every task — but steps 55 & 56 committed 27 and 54 files (generated artifacts).
GrapeRoot wrote
91 files
+8,375 lines
All purposeful changes — no generated artifacts, clean git history across all 10 tasks.
By category
Performance
Security
Testing
Core finding
Proactive context beats reactive querying
code-review-graph builds a networkx knowledge graph from tree-sitter AST parsing and exposes tools like get_review_context and get_impact_radius. It never hallucinated — every task got a real implementation. But it's reactive: Claude must call the right tools in the right order. If it doesn't ask for context on the right files, the implementation is shallow.
GrapeRoot is proactive: the dual-graph pre-loads the most relevant symbols and relationships before Claude starts every turn. Claude arrives already knowing which files matter, what they call, and what's absent from the call graph — without needing to issue a single query.
Step 55 (helmet)
CRG: Correct implementation — but 27 files committed (generated artifacts)
GrapeRoot: 10 purposeful files, identical feature coverage
Step 56 (CSRF)
CRG: Working CSRF stack — but 54 files committed, polluting git history
GrapeRoot: 12 clean files: HMAC tokens + double verification + frontend
Step 53 (Redis)
CRG: Correct cache in 2 files, non-standard key pattern, $0.62
GrapeRoot: 6 files, full pattern-delete invalidation, graceful fallback, $0.98
The git hygiene problem
On steps 55 and 56, code-review-graph committed 27 and 54 files respectively — almost certainly node_modules or generated build artifacts. The implementations were functionally correct. But in a real project, committing hundreds of generated files is a blocker: it pollutes git log, breaks code review, and can corrupt other developers' working trees. This is the most disqualifying failure for production use.
Benchmark run March 2026 · Claude Sonnet 4.6 · 10 tasks on ColabNotes (~188 files, TypeScript) · LLM-as-judge via Claude Haiku 4.5 · isolated project copies from identical starting state (same git commit after 50 prior steps).