Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 278-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts

10/10cost + quality winsclean sweep across challenge benchmark

34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

Benchmark Tests

GrapeRoot vs code-review-graph vs jCodeMunch

10 production tasks · one codebase · three graph-based AI tools · proactive vs reactive context

📝

ColabNotes

github.com/kunal12203/collabnotes-benchmark

Notion-like collaborative notes app · Node.js + Express + TypeScript · Prisma + PostgreSQL + Redis · React frontend · ~197 files · 60 benchmark steps applied

84.7/100

GrapeRoot avg

71.5/100

code-review-graph avg

34.5/100

jCodeMunch avg

8–0 vs CRG

GrapeRoot wins

Step-by-step scorecard

Task	Cat.	CRG	GR	CRG files	GR files	CRG cost	GR cost
Fix N+1 Queries	Perf	70	80↑	7	7	$0.50	$0.14
Add DB Indexes	Perf	72	78↑	3	2	$0.89	$0.27
Redis Cache	Perf	76	88↑	2	6	$0.62	$0.98
Cursor Pagination	Perf	72	86↑	6	9	$0.74	$0.78
Helmet + CSP	Secu	78⚠	85↑	27	10	$0.65	$0.33
CSRF Protection	Secu	62⚠	88↑	54	12	$0.74	$0.73
XSS Sanitization	Secu	72	90↑	4	11	$0.62	$1.30
SQL Injection Audit	Secu	65	78↑	2	6	$0.55	$0.46
Auth Service Tests	Test	68	87↑	23	18	$0.60	$0.82
Block Service Tests	Test	80	87↑	2	10	$1.05	$0.98
Total / Avg		71.5	84.7	130 files	91 files	$6.96	$6.81

⚠ = git hygiene issue (node_modules or generated files committed)

code-review-graph wrote

130 files

+11,416 lines

Real implementations on every task — but steps 55 & 56 committed 27 and 54 files (generated artifacts).

GrapeRoot wrote

91 files

+8,375 lines

All purposeful changes — no generated artifacts, clean git history across all 10 tasks.

By category

Performance

CRG avg72.5/100

GrapeRoot avg83/100

CRG cost$2.75

GR cost$2.14

Security

CRG avg69.25/100

GrapeRoot avg85.25/100

CRG cost$2.56

GR cost$2.85

Testing

CRG avg74/100

GrapeRoot avg87/100

CRG cost$1.65

GR cost$1.80

Core finding

Proactive context beats reactive querying

code-review-graph builds a networkx knowledge graph from tree-sitter AST parsing and exposes tools like get_review_context and get_impact_radius. It never hallucinated — every task got a real implementation. But it's reactive: Claude must call the right tools in the right order. If it doesn't ask for context on the right files, the implementation is shallow.

GrapeRoot is proactive: the dual-graph pre-loads the most relevant symbols and relationships before Claude starts every turn. Claude arrives already knowing which files matter, what they call, and what's absent from the call graph — without needing to issue a single query.

Step 55 (helmet)

CRG: Correct implementation — but 27 files committed (generated artifacts)

GrapeRoot: 10 purposeful files, identical feature coverage

Step 56 (CSRF)

CRG: Working CSRF stack — but 54 files committed, polluting git history

GrapeRoot: 12 clean files: HMAC tokens + double verification + frontend

Step 53 (Redis)

CRG: Correct cache in 2 files, non-standard key pattern, $0.62

GrapeRoot: 6 files, full pattern-delete invalidation, graceful fallback, $0.98

The git hygiene problem

On steps 55 and 56, code-review-graph committed 27 and 54 files respectively — almost certainly node_modules or generated build artifacts. The implementations were functionally correct. But in a real project, committing hundreds of generated files is a blocker: it pollutes git log, breaks code review, and can corrupt other developers' working trees. This is the most disqualifying failure for production use.

Benchmark run March 2026 · Claude Sonnet 4.6 · 10 tasks on ColabNotes (~188 files, TypeScript) · LLM-as-judge via Claude Haiku 4.5 · isolated project copies from identical starting state (same git commit after 50 prior steps).