Benchmarks

Does it actually work? We ran the numbers.

5 benchmark runs · 80+ prompts · real 92-file production codebase · same model (Claude Sonnet 4.6), same questions — with and without GrapeRoot.

45%cheaper on complex tasksv3.8.35 challenge benchmark · 10/10 prompts
10/10cost + quality winsclean sweep across challenge benchmark
34%avg savings, E2E benchmark16/20 cost wins · real-world multi-step prompts

GrapeRoot vs jCodeMunch — Naming Audit

10 naming audit tasks · CollabNotes TypeScript codebase (92 src files) · LLM judge with cache/index files stripped · Claude Sonnet 4.6

66.4/100

GrapeRoot avg quality

65.7/100

jCodeMunch avg quality

5–3 (2 tie)

GR quality wins

35%

GR is cheaper by

Equal quality. 55% cheaper.

Across 10 naming audit tasks the LLM judge gave GrapeRoot 66.4/100 and jCodeMunch 65.7/100 — statistically equivalent. Yet GrapeRoot cost $14.11 vs jCodeMunch's $21.87. jCodeMunch spent 55% more and got the same result because its mandatoryindex_folder +get_repo_outline startup sequence burns tokens indexing the entire codebase on every session.

Charts

Summary — Wins & Total Cost

GR wins 5/10 on quality at 55% lower cost

Summary — Wins & Total Cost
Expand

Cost per Prompt

Per-task dollar spend — GR vs JCM

Cost per Prompt
Expand

LLM Quality per Prompt

Cache-filtered judge score (0–100) per task

LLM Quality per Prompt
Expand

Cost vs Quality Scatter

Lines connect same prompt — GR vs JCM

Cost vs Quality Scatter
Expand

Quality Dimensions Radar

5-axis score breakdown as % of max

Quality Dimensions Radar
Expand

Dimension Breakdown

Grouped bar: each dimension % of max score

Dimension Breakdown
Expand

Cost Efficiency (Score per $)

LLM score per dollar — GR is 1.8× more efficient

Cost Efficiency (Score per $)
Expand

Quality by Scope

Average LLM score per task category

Quality by Scope
Expand

Wall Time per Prompt

Minutes per task — GR is 15% faster on average

Wall Time per Prompt
Expand

Prompt-by-prompt scorecard

TaskCat.GRJCM
Codebase AuditCode8074
Middleware NamesMidd7575
DTO NamingDTO6948
Boolean NamesCode8079
Generic NamesCode7670
Service MethodsServ7480
Auth ModuleAuth7068
Workspace ModuleWork5555
Types & InterfacesType2540
Orphan DetectionCode6068
Avg / Total66.465.7

Score dimensions (avg, % of max)

File Relevance

Did it change the right files?

GR
56%
JCM
56%

Findings Quality

Real naming issues, well-reasoned?

GR
71%
JCM
71%

Fix Completeness

Renames applied at all call sites?

GR
68%
JCM
59%

Consensus Quality

Evidence of real subagent voting?

GR
55%
JCM
71%

Coverage

Scope fully scanned?

GR
77%
JCM
76%

Where GrapeRoot leads

DTO Naming
69 GRvs48 JCM$0.25 saved
Auth Module
70 GRvs68 JCM$2.27 saved
Codebase (×4 avg)
71.5 GRvs67.75 JCM$0.79 saved

Where jCodeMunch leads

Services
74 GRvs80 JCM
Types & Interfaces
25 GRvs40 JCM
Orphan Detection
60 GRvs68 JCM

JCM's symbol indexing helps on type-scan and orphan detection tasks where full AST coverage matters more than focused graph context.

Core finding

Indexing overhead is the real cost, not the task complexity.

jCodeMunch's mandatory startup sequence — index_folder then get_repo_outline — reads the entire codebase on every session and dumps a full symbol inventory into context. On narrow-scope prompts (middleware, auth, workspace) this wastes 40–80% of the token budget on symbols that will never be touched.

GrapeRoot's dual-graph retrieval front-loads only the files the task actually needs, verified by the judge's Fix Completeness dimension where GrapeRoot averaged 17.1/25 vs jCodeMunch's 14.8/25 — meaning GrapeRoot applied renames more consistently across all call sites despite using fewer tokens.

Methodology

rejudge_naming_audit.py — LLM judge promptPython · simplified
# Cache/build/index files stripped before judging:
EXCLUDE = [".dual-graph/", "dist/", "CLAUDE.md", ".env", "benchmark*", "node_modules/"]
src_files = [f for f in changed_files if f.startswith("src/")]

# 5-dimension rubric (max 100):
# file_relevance    /20 — right files for this scope?
# findings_quality  /25 — naming issues real & justified?
# fix_completeness  /25 — renames applied at all call sites?
# consensus_quality /15 — genuine subagent voting evidence?
# coverage          /15 — scope fully scanned?

# No timeout — each judge call runs until complete
subprocess.run(["claude", "-p", judge_prompt,
    "--model", "claude-sonnet-4-6",
    "--no-session-persistence", "--dangerously-skip-permissions"],
    timeout=None)

Benchmark run March 2026 · Claude Sonnet 4.6 · 10 naming audit tasks on CollabNotes (92 src files, TypeScript/Express/Prisma) · LLM judge: Claude Sonnet 4.6 · 3-way comparison (Normal / GrapeRoot / JCodeMunch) · isolated project copies per variant · judge cost: $1.31 total.