1. I dont have hard metrics at hand but with the latest Sonnet I'd say we reach consensus around 80% of the time, with Opus is almost always but we are not using it due to cost
2. The difference I see in agent behavior when they don't reach consensus is usually either
- when one of them didn't explore enough and lack context
- and/or when their risk assessment is off
The latest happen often, in other workflows based on agents we are now giving clear instruction on how to assess risk and where to draw a line to consider something a true positive.
3. validation is on Sonnet, we don't use persona based prompts but all the 3 validators get's the same task and context. The agent orchestrating them will take their output and make the final decision. We use an internal fork of the claude code github action for now.
Interesting article. Will be great to see similar approaches appear as ConMon is becoming extremely difficult to manage at scale and existing tools aren't really covering what we need.
> Upon issue creation another workflow spins up three independent coding agents to analyze the finding.
I'm curious
1) what the current statistics are for consensus
2) how the agents may/may not perform independently
3) what the agent profiles are and how they differ (model, harness, prompt/persona, all three?)
1. I dont have hard metrics at hand but with the latest Sonnet I'd say we reach consensus around 80% of the time, with Opus is almost always but we are not using it due to cost
2. The difference I see in agent behavior when they don't reach consensus is usually either
- when one of them didn't explore enough and lack context
- and/or when their risk assessment is off
The latest happen often, in other workflows based on agents we are now giving clear instruction on how to assess risk and where to draw a line to consider something a true positive.
3. validation is on Sonnet, we don't use persona based prompts but all the 3 validators get's the same task and context. The agent orchestrating them will take their output and make the final decision. We use an internal fork of the claude code github action for now.
Interesting article. Will be great to see similar approaches appear as ConMon is becoming extremely difficult to manage at scale and existing tools aren't really covering what we need.