Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 278-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

GrapeRoot vs code-review-graph vs jCodeMunch

10 production tasks · one codebase · three graph-based AI tools · proactive vs reactive context

📝

ColabNotes

github.com/kunal12203/collabnotes-benchmark

Notion-like collaborative notes app · Node.js + Express + TypeScript · Prisma + PostgreSQL + Redis · React frontend · ~197 files · 60 benchmark steps applied

84.7/100

GrapeRoot avg

71.5/100

code-review-graph avg

34.5/100

jCodeMunch avg

8–0 vs CRG

GrapeRoot wins

Step-by-step scorecard

TaskCat.CRGGR
Fix N+1 QueriesPerf7080
Add DB IndexesPerf7278
Redis CachePerf7688
Cursor PaginationPerf7286
Helmet + CSPSecu7885
CSRF ProtectionSecu6288
XSS SanitizationSecu7290
SQL Injection AuditSecu6578
Auth Service TestsTest6887
Block Service TestsTest8087
Total / Avg71.584.7

⚠ = git hygiene issue (node_modules or generated files committed)

code-review-graph wrote

130 files

+11,416 lines

Real implementations on every task — but steps 55 & 56 committed 27 and 54 files (generated artifacts).

GrapeRoot wrote

91 files

+8,375 lines

All purposeful changes — no generated artifacts, clean git history across all 10 tasks.

By category

Performance

CRG avg72.5/100
GrapeRoot avg83/100
CRG cost$2.75
GR cost$2.14

Security

CRG avg69.25/100
GrapeRoot avg85.25/100
CRG cost$2.56
GR cost$2.85

Testing

CRG avg74/100
GrapeRoot avg87/100
CRG cost$1.65
GR cost$1.80

Core finding

Proactive context beats reactive querying

code-review-graph builds a networkx knowledge graph from tree-sitter AST parsing and exposes tools like get_review_context and get_impact_radius. It never hallucinated — every task got a real implementation. But it's reactive: Claude must call the right tools in the right order. If it doesn't ask for context on the right files, the implementation is shallow.

GrapeRoot is proactive: the dual-graph pre-loads the most relevant symbols and relationships before Claude starts every turn. Claude arrives already knowing which files matter, what they call, and what's absent from the call graph — without needing to issue a single query.

Step 55 (helmet)

CRG: Correct implementation — but 27 files committed (generated artifacts)

GrapeRoot: 10 purposeful files, identical feature coverage

Step 56 (CSRF)

CRG: Working CSRF stack — but 54 files committed, polluting git history

GrapeRoot: 12 clean files: HMAC tokens + double verification + frontend

Step 53 (Redis)

CRG: Correct cache in 2 files, non-standard key pattern, $0.62

GrapeRoot: 6 files, full pattern-delete invalidation, graceful fallback, $0.98

The git hygiene problem

On steps 55 and 56, code-review-graph committed 27 and 54 files respectively — almost certainly node_modules or generated build artifacts. The implementations were functionally correct. But in a real project, committing hundreds of generated files is a blocker: it pollutes git log, breaks code review, and can corrupt other developers' working trees. This is the most disqualifying failure for production use.

Benchmark run March 2026 · Claude Sonnet 4.6 · 10 tasks on ColabNotes (~188 files, TypeScript) · LLM-as-judge via Claude Haiku 4.5 · isolated project copies from identical starting state (same git commit after 50 prior steps).