The answer, my friendow, is blowing in the context window [1]…
_______
…LLMs already "know" these books deeply, but without a structured prompt scaffold they apply that knowledge inconsistently and at low confidence. Giving the model a explicit lens — "review this as if you're checking against Clean Code heuristics C1–C36" — concentrates attention and dramatically reduces hallucinated or off-topic feedback…
…
Where I'd push back or warn you:
Context collapse is your #1 enemy. Clean Code was written for Java in 2008. DDIA is about distributed systems at scale. If you apply the Clean Code reviewer to a 50-line Python script, you'll get pedantic nonsense about function length when the actual problem might be that the data model is wrong. Your skill selection logic needs to be domain-aware, not just "throw all skills at every file"…
Cool idea — “book as rubric” is a nice way to avoid vague self-critiques.
One thing I’ve found helps keep agent review loops from getting shallow is to separate levels of critique:
1) a fast “lint” pass (format, obvious bugs, missing tests)
2) a domain pass that’s forced to cite specific passages/rules from the rubric
3) a “counterexample” pass: reviewer must propose at least 1 concrete failing scenario + how to reproduce
Question: are you capturing the reviewer’s evidence (links, excerpts, failing cases) in a structured log so a human can audit why the agent changed something?
Related: I’m working on SkillForge (https://skillforge.expert) — we’ve been thinking about similar auditability, but for UI workflows: record once, then replay with checkpoints + retries, so humans approve at meaningful boundaries instead of every click.
The answer, my friendow, is blowing in the context window [1]…
_______
…LLMs already "know" these books deeply, but without a structured prompt scaffold they apply that knowledge inconsistently and at low confidence. Giving the model a explicit lens — "review this as if you're checking against Clean Code heuristics C1–C36" — concentrates attention and dramatically reduces hallucinated or off-topic feedback…
…
Where I'd push back or warn you:
Context collapse is your #1 enemy. Clean Code was written for Java in 2008. DDIA is about distributed systems at scale. If you apply the Clean Code reviewer to a 50-line Python script, you'll get pedantic nonsense about function length when the actual problem might be that the data model is wrong. Your skill selection logic needs to be domain-aware, not just "throw all skills at every file"…
…
_______
[1] https://g2ww.short.gy/ZLStasQ1
Cool idea — “book as rubric” is a nice way to avoid vague self-critiques.
One thing I’ve found helps keep agent review loops from getting shallow is to separate levels of critique: 1) a fast “lint” pass (format, obvious bugs, missing tests) 2) a domain pass that’s forced to cite specific passages/rules from the rubric 3) a “counterexample” pass: reviewer must propose at least 1 concrete failing scenario + how to reproduce
Question: are you capturing the reviewer’s evidence (links, excerpts, failing cases) in a structured log so a human can audit why the agent changed something?
Related: I’m working on SkillForge (https://skillforge.expert) — we’ve been thinking about similar auditability, but for UI workflows: record once, then replay with checkpoints + retries, so humans approve at meaningful boundaries instead of every click.