TL;DR: AI code reviewer went from "confidently wrong" to actually useful. Fix: stopped pre-selecting context, gave the model tools to fetch evidence itself. Now it either cites file:line or stays quiet.
The Problem
Our AI reviewer flagged a "blocker." Cited the diff, built a plausible argument, suggested a fix.
The senior engineer spent 20 minutes disproving it. Did the guard clause get missed? Two files away. The model never had that file, so it guessed and sounded certain.
Pre-selecting context doesn't work. Code review follows evidence chains, and chains aren't predictable.
The Fix
Agentic loop:
Model: "This calls validate()"
→ search_code("validate")
Model: "Two call sites use withRetry(). Third doesn't."
→ get_file_content("config/defaults.go")
Model: "Missing timeout. Bug found."
→ submit_code_review(structured_output)
Model fetches what it needs. Loop ends when it submits structured findings (path, line, severity, evidence—not prose).
What Changed
- Before: "This might break retries."
- After: "In foo/bar.go:123, call bypasses withRetry(). Other call sites use it (see search results). Wrap or document."
The Pieces
1. Tools — boring, fast, deterministic. get_file_content, search_code. Treat them like production APIs.
2. Terminal action — structured JSON submission, not Markdown. No evidence? Can't submit.
3. Loop — model turn → tool turn → repeat. Aggressive context shrinking (old results truncated, diff stays).
4. Guardrails — iteration caps, timeouts, self-critique checklist.
Evaluation
Pick 5-10 PRs where you know the real risks. Check:
- Found the issue?
- Cited exact file:line?
- Hallucinated anything?
- Fetched evidence when uncertain?
Iterate on tools, not prompts.
The Pattern
Don't build bigger prompts. Build a loop where the model can fetch evidence, test hypotheses, and submit only when it can cite sources. That's the difference between "sounds right" and "is right."
TL;DR: AI code reviewer went from "confidently wrong" to actually useful. Fix: stopped pre-selecting context, gave the model tools to fetch evidence itself. Now it either cites file:line or stays quiet.
The Problem
Our AI reviewer flagged a "blocker." Cited the diff, built a plausible argument, suggested a fix. The senior engineer spent 20 minutes disproving it. Did the guard clause get missed? Two files away. The model never had that file, so it guessed and sounded certain. Pre-selecting context doesn't work. Code review follows evidence chains, and chains aren't predictable.
The Fix
Agentic loop:
Model: "This calls validate()" → search_code("validate") Model: "Two call sites use withRetry(). Third doesn't." → get_file_content("config/defaults.go") Model: "Missing timeout. Bug found." → submit_code_review(structured_output) Model fetches what it needs. Loop ends when it submits structured findings (path, line, severity, evidence—not prose).
What Changed
- Before: "This might break retries." - After: "In foo/bar.go:123, call bypasses withRetry(). Other call sites use it (see search results). Wrap or document."
The Pieces
1. Tools — boring, fast, deterministic. get_file_content, search_code. Treat them like production APIs. 2. Terminal action — structured JSON submission, not Markdown. No evidence? Can't submit. 3. Loop — model turn → tool turn → repeat. Aggressive context shrinking (old results truncated, diff stays). 4. Guardrails — iteration caps, timeouts, self-critique checklist.
Evaluation
Pick 5-10 PRs where you know the real risks. Check: - Found the issue? - Cited exact file:line? - Hallucinated anything? - Fetched evidence when uncertain? Iterate on tools, not prompts.
The Pattern
Don't build bigger prompts. Build a loop where the model can fetch evidence, test hypotheses, and submit only when it can cite sources. That's the difference between "sounds right" and "is right."