The Coverage Illusion: Why Red-Team Metrics Don't Measure What You Think

Series 10 — The Measurement Problem — opens here. The argument across three posts: the enterprise agentic AI security measurement apparatus produces numbers that look like security signal but are not. The gap is structural — it cannot be closed by better benchmarks or more rigorous audits, because the measurement is happening at the wrong architectural layer. This first post takes on coverage. The next two pick up posture-versus-behavior and the compounding effect. The wedge is uncomfortable: nobody in the enterprise security space is asking the measurement question seriously yet, and the field is racing to ship agentic security tooling without a credible way to know whether any of it works.

The agentic AI security report arrives with a number on the cover. Two hundred attack vectors tested. Ninety-five percent coverage. Pass on prompt injection, pass on jailbreak resistance, pass on tool-call abuse. The number is real. The testing happened. The vendor invoice is paid. The procurement team logs the result. The auditor accepts the attestation. The number is also not what the buyer thinks it is.

Coverage metrics report breadth across an attack-surface taxonomy. They do not report depth against an architectural failure-mode taxonomy. The two are not the same. An agentic system can pass a comprehensive prompt-injection red team and remain trivially exploitable through a tool-call authorization bypass that the red team’s taxonomy did not include — not because the red team was negligent, but because the prevailing attack-surface taxonomies do not map cleanly onto how agentic systems actually fail. This post argues that until red-team coverage is reported against an architectural failure-mode taxonomy, the resulting numbers are uninterpretable as security signal. The post is not an attack on red teamers. Red teaming is a necessary capability. The artifact that fails is coverage-as-currently-reported.

The Coverage Number on the Report

The agentic AI security assessment lands on the desk in a familiar shape. There is a number. There is a list. There is a percentage. The number is large. The list is comprehensive. The percentage rounds up. The CISO scans it, the procurement team logs it, the audit committee receives it. The agent ships into production with the coverage attestation in the file.

The testing was real. Vendors operate against published attack-surface taxonomies — OWASP’s LLM Top 10 and the emerging Agentic Security Initiative, MITRE’s ATLAS matrix, benchmark suites like AgentDojo, the broader Agent Security Bench family. Each provides a vocabulary of attack categories the test plan can be scoped against. Each is honest about its scope. Each produces real measurements when applied. The two hundred attack vectors tested are two hundred attack vectors actually tested.

The problem is not that the testing did not happen. The problem is what the resulting number is taken to mean.

What the Number Actually Measures

Coverage measures breadth across an attack-surface taxonomy. Number of prompt injection variants exercised. Number of jailbreak techniques attempted. Number of tool-call abuse patterns probed. Number of indirect injection vectors traversed through retrieved documents. Each of those numbers is a valid measurement of how much surface the test plan touched.

Surface taxonomies are organized around what is enumerable from outside the system. They name input modalities and attack techniques because those are the things a red team can list, scope, price, and exercise. The Greshake et al. canon on indirect prompt injection established the original surface taxonomy that subsequent work has extended; today’s vendor coverage reports inherit the same shape. The taxonomy is necessary. It surfaces real attack vectors that real adversaries use.

The Two Taxonomies — Stated Precisely

Surface taxonomy. Attack categories organized around enumerable input transformations — what a red team can list, scope, and exercise from outside the system. The basis of most current vendor coverage reporting. Architectural failure-mode taxonomy. Failure categories organized around where in the system the trust boundary breaks — composition seams, authorization layer placement, trust boundary inversion under retrieval, goal-trajectory drift outside the scripted perimeter. Not currently the basis of vendor coverage reporting. The two taxonomies are organized around different things, and one does not reduce to the other.

An agentic system can pass comprehensive coverage of the surface taxonomy and remain trivially exploitable through an architectural failure mode that the surface taxonomy did not include as a checkable category. The reverse is also true — a system can fail surface coverage and still resist the architectural attacks that matter for its specific deployment. The two measurements answer different questions. The reporting practice does not distinguish them.

Two Taxonomies, One Mismatch

The mismatch is the structural fact this post is about. It deserves direct treatment as a contrast between two reporting paradigms that the field currently conflates.

Surface Coverage Reporting

Measures breadth across enumerable attack categories
Scoped against published vendor and benchmark taxonomies
Reported as a single percentage or vector count
Compares cleanly across vendors in procurement
Travels well across systems with similar input surfaces
Cannot reach architectural failure modes outside the taxonomy

Architectural Failure-Mode Reporting

Measures depth against where system assumptions break
Scoped against the deployment’s specific composition pattern
Reported with explicit assumption coverage and uncertainty disclosure
Does not reduce to a single comparable metric
Catches composition failures, trust boundary inversions, authorization layer placement
Does not yet exist as a published industry standard

The empirical anchor is direct. Triedman et al. (2025) — Multi-Agent Systems Execute Arbitrary Malicious Code — demonstrates that multi-agent systems can be induced to execute arbitrary malicious code through cross-agent execution paths. The attack does not appear in the surface taxonomy as a checkable category. It is a structural composition failure — emergent from how agents pass intermediate results to one another with insufficient adversarial treatment of the intermediate. A red team operating against the surface taxonomy does not find this attack class because the taxonomy does not list it. The red team can report ninety-five percent coverage on prompt injection variants and the agent in production remains exploitable through the cross-agent path. The result is not a failure of the red team’s execution. It is a category of failure that fell outside the test plan’s organizing taxonomy.

The structural mismatch

The surface taxonomy organizes around what is enumerable from outside the system. The architectural failure-mode taxonomy organizes around where the system’s assumptions break. The two are organized around different things, and one does not reduce to the other. Coverage against the surface taxonomy reports breadth of test design effort. It does not report depth against architectural failure modes. The gap is not a tooling gap. It is a taxonomic gap that no current vendor product closes.

Why the Gap Persists

The gap is not closed by individual vigilance. It is held open by three structural pressures that operate independently of any one vendor or buyer.

Red-team economics. Engagements are scoped, time-boxed, and priced against an enumerable test plan. The taxonomy that defines the test plan is the taxonomy the engagement reports against. Probing architectural failure modes — composition failures, trust boundary inversions, authorization layer placement — requires deeper system access, longer engagement duration, and a different testing methodology than what surface coverage requires. The economics push toward broad coverage of enumerable categories. They do not fund deep probing of architectural assumptions. A red team that reports against an architectural failure-mode taxonomy charges more, takes longer, and produces a less comparable artifact at the end. Procurement does not reward this.

Vendor incentives. Vendors selling agentic AI security tooling compete on coverage breadth. Two hundred attack vectors beats fifty attack vectors in the head-to-head comparison sheet. No vendor competes by reporting coverage against an architectural failure-mode taxonomy because no architectural failure-mode taxonomy is yet the published industry standard. The OWASP Agentic Security Initiative is moving toward this work — Series 11 will track the institutional progression — but the taxonomy is not yet operationalized in product. Until it is, the breadth metric remains the marketed metric and the procurement metric.

Customer expectations. Enterprise procurement is organized around comparable metrics. A coverage percentage compares cleanly across vendors. An architectural failure-mode assessment does not — it requires interpretation, expert review, and contextual judgment about which failure modes matter for the specific deployment composition. Procurement teams cannot easily reduce that to a comparable score, and the path of least resistance reinforces the metric the field already publishes. The reinforcement is structural, not malicious. It is what comparable procurement looks like when the comparable artifact is the one organized around testability rather than around what fails.

The three pressures interlock. Each individually is a reasonable response to a real constraint. Together they hold the surface-taxonomy metric in place and keep the architectural failure-mode metric out of the procurement conversation. The system optimizes itself toward the metric that is cheap to produce, easy to compare, and interpretable as competitive advantage.

What Credible Coverage Reporting Would Require

Credible coverage reporting for agentic AI security would require three things the current apparatus does not produce.

First, coverage anchored to an architectural failure-mode taxonomy, not a surface taxonomy. The taxonomy needs to exist as a published standard, the test plan needs to be scoped against its categories, and the resulting report needs to name which failure modes were probed and which were not. This is the work the OWASP Agentic Security Initiative is moving toward and that Series 11 will track. It is not yet shipped at production scale.

Second, explicit reporting of which architectural assumptions were probed. A coverage report against an architectural taxonomy still has to be specific to the deployment’s composition. An agent that composes three tools through a single mediator has different failure surfaces than an agent that composes thirty tools through dynamic discovery. The architectural assumptions of the deployment have to be named in the report, and coverage has to be reported against those assumptions. A generic percentage will not survive contact with a specific composition.

Third, uncertainty disclosure where coverage is incomplete because the failure mode is hard to test. Some architectural failure modes — goal-trajectory drift in long-horizon agents, emergent composition failures in multi-agent systems with dynamic agent discovery — are not currently testable at production scale because the necessary infrastructure does not exist. Series 12 argues this in detail for the goal-oriented case: detection of these failures requires stateful trajectory modeling against a formal representation of the authorized objective, and neither exists at production scale today. A credible coverage report says so. It does not round the gap to a percentage.

The Open Problem, Named Precisely

None of the three properties is currently produced by any vendor at production scale. The Coverage Illusion is not solved by better red teaming, better tooling, or more rigorous procurement. It is solved by changing what coverage reports against — anchoring the metric to architectural failure modes, naming the deployment-specific assumptions probed, and disclosing uncertainty where coverage is incomplete. That change has not happened, and no current vendor product makes it.

This is the first of three cuts. The second turns to the gap between what assessments measure and what runtime behavior actually does. The third names what the two failures together leave on the floor. The Coverage Illusion is the entry point because coverage is the metric enterprises currently rely on most, and it is the one that gives the most false confidence. Naming it precisely is where the argument starts.

The Series Opening

The enterprise agentic AI security measurement apparatus produces numbers that look like security signal but are not. Coverage measures breadth across a surface taxonomy. It does not measure depth against the architectural failure modes that fail in production. The gap is structural — held open by red-team economics, vendor incentives, and procurement comparability. It is not closed by better testing. It is closed by changing what coverage reports against. Series 10 opens here. Two more cuts follow.

Series 10 · The Measurement Problem

Post 01 · Now Reading The Coverage Illusion

Post 02 · Published The Posture Mirage

Post 03 · Published No Floor Underneath

→ Surface TaxonomyA vocabulary of attack categories organized around input modalities and attack techniques. Testable from outside the system. The basis of most current vendor coverage reporting and the source of comparable procurement metrics.
→ Architectural Failure-Mode TaxonomyA vocabulary organized around where in the system the trust boundary breaks. Composition failures, authorization layer placement, trust boundary inversion. Not currently the basis of vendor coverage reporting at production scale.
→ The Coverage IllusionThe structural mismatch between what coverage metrics measure (test design breadth across a surface taxonomy) and what enterprises take them to mean (security posture against deployed-system failure modes).
→ Structural EnforcementLuminity’s term for security properties that hold by architectural construction rather than by probabilistic detection. The reference point against which credible coverage reporting should ultimately be anchored.

Series 1 — Where Agentic AI Breaks Why Safety Alignment Fails at the Tool-Call Layer 5 posts · architectural failure modes
Series 3 — The Invisible Attack Indirect Prompt Injection 3 posts · luminitydigital.com
Series 4 — Fault Lines Hidden Structural Risks 3 posts · luminitydigital.com
Series 12 — The Goal-Oriented Shift Detection on the Frontier 4 posts · publishing concurrent

The Coverage Illusion: Why Red-Team Metrics Don’t Measure What You Think

The Coverage Number on the Report

What the Number Actually Measures

Two Taxonomies, One Mismatch

Why the Gap Persists

What Credible Coverage Reporting Would Require

If this post raised questions worth exploring for your deployment

Like this:

Related

The Coverage Illusion: Why Red-Team Metrics Don’t Measure What You Think

The Coverage Number on the Report

What the Number Actually Measures

Two Taxonomies, One Mismatch

Why the Gap Persists

What Credible Coverage Reporting Would Require

If this post raised questions worth exploring for your deployment

Share this:

Like this:

Related