Show HN: Agent Red Team – Adversarial testing for AI agents before production

(agentredteam.ai)

3 points | by LukataSolutions 2 hours ago ago

3 comments

Testing actions instead of text is the right move, but coverage will lag as real world edge cases evolve.

Agreed! Static coverage will always lag behind real-world edge cases. The plan is to expand attack packs continuously and eventually support runtime testing alongside static analysis. Right now the goal is covering the baseline adversarial cases that most teams ship without testing at all.

LukataSolutions an hour ago

Founder disclosure: I built Agent Red Team.

Most AI agents get tested on whether they work. Almost none get tested on whether they can be tricked, manipulated, or hijacked. If you are deploying agents that take actions like sending emails, writing code, or making purchases, this is the gap.

Problem it solves

Most agent evals test whether the agent does the right thing. Almost none test whether the agent resists the wrong thing. Prompt injection, approval bypass, tool misuse, and memory poisoning usually do not show up in happy path evals, and most teams ship without testing for them.

How it works

You submit an agent's system prompt, tool schemas, or full configuration. The pipeline runs in 5 stages:

Normalization: parses and classifies the artifact, whether that is a system prompt, tool config, or multi-agent setup

Threat modeling: maps the exposed attack surface against 12 threat categories aligned with OWASP, NIST, and MITRE frameworks

Attack simulation: runs curated attack packs, 40 cases across 10 packs, covering prompt injection, goal hijacking, tool parameter abuse, memory poisoning, identity escalation, approval bypass, data exfiltration, unsafe action chaining, supply chain trust, ambiguity exploitation, resource abuse, and cross-agent delegation attacks

Analysis: an LLM evaluates each attack case against the artifact, but with zero authority. No tools, no network, no code execution, no writes. It gets the content as read-only evidence and returns structured JSON only

Validation: 23 deterministic code rules, not an LLM, gate the output. These enforce evidence to claim linkage, exploit chain completeness, severity to evidence alignment, mitigation specificity, and they reject vague filler. If the report fails validation, the pipeline retries with feedback. If it exhausts retries, the scan is refunded automatically

The output is a structured report with concrete exploit paths, including entry point, manipulation step, broken control, and resulting action, not just vague language like "this might be vulnerable."

Key design decisions

User content is treated as data, never authority. The analysis model cannot be instructed by the content it is analyzing. This is the core prompt injection resistance property.

The validator is deterministic code, not LLM judgment. If a finding cannot cite evidence from the actual artifact, it gets rejected.

The system fails closed on ambiguity. If it is not sure, it rejects rather than guesses.

Limitations

Coverage is strongest in prompt injection, tool misuse, and approval bypass. Categories like cross-agent delegation and supply chain trust still need deeper attack packs.

The system analyzes static configurations. It does not execute your agent at runtime, so it cannot catch bugs that only show up in multi-turn conversations or under specific state conditions.

False negatives are possible, especially for novel attack patterns not covered by existing packs. False positives are suppressed by the validator, but not eliminated.

The threat model assumes the attacker interacts through the agent's normal input surfaces. It does not model infrastructure-level attacks like network or container escape.

What I want feedback on

Which attack classes feel most under-tested in your agent systems today? What would make a tool like this more credible or rigorous to you?

Especially interested in hearing from people building multi-agent systems or agents with write access to external systems.

Demo: https://agentredteam.ai

Disclaimer:

Analysis takes up to 7 minutes, then your report will be generated. One free scan per day, resets at midnight UTC.