The Coverage Illusion — Luminity Digital
Series 10 · The Measurement Problem · Post 1 of 3 · May 2026
Agentic AI Security

The Coverage Illusion: Why Red-Team Metrics Don’t Measure What You Think

Enterprise agentic AI security reports lead with a coverage number. Two hundred attack vectors tested. Ninety-five percent coverage. The number is real. It is also not what the buyer thinks it is. Coverage measures test design breadth against a surface taxonomy. It does not measure depth against the architectural failure modes that actually fail in production. The gap is structural, not tooling.

May 2026 Tom M. Gomez Luminity Digital 11 Min Read
Series 10 — The Measurement Problem — opens here. The argument across three posts: the enterprise agentic AI security measurement apparatus produces numbers that look like security signal but are not. The gap is structural — it cannot be closed by better benchmarks or more rigorous audits, because the measurement is happening at the wrong architectural layer. This first post takes on coverage. The next two pick up posture-versus-behavior and the compounding effect. The wedge is uncomfortable: nobody in the enterprise security space is asking the measurement question seriously yet, and the field is racing to ship agentic security tooling without a credible way to know whether any of it works.

The agentic AI security report arrives with a number on the cover. Two hundred attack vectors tested. Ninety-five percent coverage. Pass on prompt injection, pass on jailbreak resistance, pass on tool-call abuse. The number is real. The testing happened. The vendor invoice is paid. The procurement team logs the result. The auditor accepts the attestation. The number is also not what the buyer thinks it is.

Coverage metrics report breadth across an attack-surface taxonomy. They do not report depth against an architectural failure-mode taxonomy. The two are not the same. An agentic system can pass a comprehensive prompt-injection red team and remain trivially exploitable through a tool-call authorization bypass that the red team’s taxonomy did not include — not because the red team was negligent, but because the prevailing attack-surface taxonomies do not map cleanly onto how agentic systems actually fail. This post argues that until red-team coverage is reported against an architectural failure-mode taxonomy, the resulting numbers are uninterpretable as security signal. The post is not an attack on red teamers. Red teaming is a necessary capability. The artifact that fails is coverage-as-currently-reported.

The Coverage Number on the Report

The agentic AI security assessment lands on the desk in a familiar shape. There is a number. There is a list. There is a percentage. The number is large. The list is comprehensive. The percentage rounds up. The CISO scans it, the procurement team logs it, the audit committee receives it. The agent ships into production with the coverage attestation in the file.

The testing was real. Vendors operate against published attack-surface taxonomies — OWASP’s LLM Top 10 and the emerging Agentic Security Initiative, MITRE’s ATLAS matrix, benchmark suites like AgentDojo, the broader Agent Security Bench family. Each provides a vocabulary of attack categories the test plan can be scoped against. Each is honest about its scope. Each produces real measurements when applied. The two hundred attack vectors tested are two hundred attack vectors actually tested.

The problem is not that the testing did not happen. The problem is what the resulting number is taken to mean.

What the Number Actually Measures

Coverage measures breadth across an attack-surface taxonomy. Number of prompt injection variants exercised. Number of jailbreak techniques attempted. Number of tool-call abuse patterns probed. Number of indirect injection vectors traversed through retrieved documents. Each of those numbers is a valid measurement of how much surface the test plan touched.

Surface taxonomies are organized around what is enumerable from outside the system. They name input modalities and attack techniques because those are the things a red team can list, scope, price, and exercise. The Greshake et al. canon on indirect prompt injection established the original surface taxonomy that subsequent work has extended; today’s vendor coverage reports inherit the same shape. The taxonomy is necessary. It surfaces real attack vectors that real adversaries use.

The Two Taxonomies — Stated Precisely

Surface taxonomy. Attack categories organized around enumerable input transformations — what a red team can list, scope, and exercise from outside the system. The basis of most current vendor coverage reporting. Architectural failure-mode taxonomy. Failure categories organized around where in the system the trust boundary breaks — composition seams, authorization layer placement, trust boundary inversion under retrieval, goal-trajectory drift outside the scripted perimeter. Not currently the basis of vendor coverage reporting. The two taxonomies are organized around different things, and one does not reduce to the other.

An agentic system can pass comprehensive coverage of the surface taxonomy and remain trivially exploitable through an architectural failure mode that the surface taxonomy did not include as a checkable category. The reverse is also true — a system can fail surface coverage and still resist the architectural attacks that matter for its specific deployment. The two measurements answer different questions. The reporting practice does not distinguish them.

Two Taxonomies, One Mismatch

The mismatch is the structural fact this post is about. It deserves direct treatment as a contrast between two reporting paradigms that the field currently conflates.

Surface Coverage Reporting
  • Measures breadth across enumerable attack categories
  • Scoped against published vendor and benchmark taxonomies
  • Reported as a single percentage or vector count
  • Compares cleanly across vendors in procurement
  • Travels well across systems with similar input surfaces
  • Cannot reach architectural failure modes outside the taxonomy
Architectural Failure-Mode Reporting
  • Measures depth against where system assumptions break
  • Scoped against the deployment’s specific composition pattern
  • Reported with explicit assumption coverage and uncertainty disclosure
  • Does not reduce to a single comparable metric
  • Catches composition failures, trust boundary inversions, authorization layer placement
  • Does not yet exist as a published industry standard

The empirical anchor is direct. Triedman et al. (2025) — Multi-Agent Systems Execute Arbitrary Malicious Code — demonstrates that multi-agent systems can be induced to execute arbitrary malicious code through cross-agent execution paths. The attack does not appear in the surface taxonomy as a checkable category. It is a structural composition failure — emergent from how agents pass intermediate results to one another with insufficient adversarial treatment of the intermediate. A red team operating against the surface taxonomy does not find this attack class because the taxonomy does not list it. The red team can report ninety-five percent coverage on prompt injection variants and the agent in production remains exploitable through the cross-agent path. The result is not a failure of the red team’s execution. It is a category of failure that fell outside the test plan’s organizing taxonomy.

The structural mismatch

The surface taxonomy organizes around what is enumerable from outside the system. The architectural failure-mode taxonomy organizes around where the system’s assumptions break. The two are organized around different things, and one does not reduce to the other. Coverage against the surface taxonomy reports breadth of test design effort. It does not report depth against architectural failure modes. The gap is not a tooling gap. It is a taxonomic gap that no current vendor product closes.

Why the Gap Persists

The gap is not closed by individual vigilance. It is held open by three structural pressures that operate independently of any one vendor or buyer.

Red-team economics. Engagements are scoped, time-boxed, and priced against an enumerable test plan. The taxonomy that defines the test plan is the taxonomy the engagement reports against. Probing architectural failure modes — composition failures, trust boundary inversions, authorization layer placement — requires deeper system access, longer engagement duration, and a different testing methodology than what surface coverage requires. The economics push toward broad coverage of enumerable categories. They do not fund deep probing of architectural assumptions. A red team that reports against an architectural failure-mode taxonomy charges more, takes longer, and produces a less comparable artifact at the end. Procurement does not reward this.

Vendor incentives. Vendors selling agentic AI security tooling compete on coverage breadth. Two hundred attack vectors beats fifty attack vectors in the head-to-head comparison sheet. No vendor competes by reporting coverage against an architectural failure-mode taxonomy because no architectural failure-mode taxonomy is yet the published industry standard. The OWASP Agentic Security Initiative is moving toward this work — Series 11 will track the institutional progression — but the taxonomy is not yet operationalized in product. Until it is, the breadth metric remains the marketed metric and the procurement metric.

Customer expectations. Enterprise procurement is organized around comparable metrics. A coverage percentage compares cleanly across vendors. An architectural failure-mode assessment does not — it requires interpretation, expert review, and contextual judgment about which failure modes matter for the specific deployment composition. Procurement teams cannot easily reduce that to a comparable score, and the path of least resistance reinforces the metric the field already publishes. The reinforcement is structural, not malicious. It is what comparable procurement looks like when the comparable artifact is the one organized around testability rather than around what fails.

The three pressures interlock. Each individually is a reasonable response to a real constraint. Together they hold the surface-taxonomy metric in place and keep the architectural failure-mode metric out of the procurement conversation. The system optimizes itself toward the metric that is cheap to produce, easy to compare, and interpretable as competitive advantage.

What Credible Coverage Reporting Would Require

Credible coverage reporting for agentic AI security would require three things the current apparatus does not produce.

First, coverage anchored to an architectural failure-mode taxonomy, not a surface taxonomy. The taxonomy needs to exist as a published standard, the test plan needs to be scoped against its categories, and the resulting report needs to name which failure modes were probed and which were not. This is the work the OWASP Agentic Security Initiative is moving toward and that Series 11 will track. It is not yet shipped at production scale.

Second, explicit reporting of which architectural assumptions were probed. A coverage report against an architectural taxonomy still has to be specific to the deployment’s composition. An agent that composes three tools through a single mediator has different failure surfaces than an agent that composes thirty tools through dynamic discovery. The architectural assumptions of the deployment have to be named in the report, and coverage has to be reported against those assumptions. A generic percentage will not survive contact with a specific composition.

Third, uncertainty disclosure where coverage is incomplete because the failure mode is hard to test. Some architectural failure modes — goal-trajectory drift in long-horizon agents, emergent composition failures in multi-agent systems with dynamic agent discovery — are not currently testable at production scale because the necessary infrastructure does not exist. Series 12 argues this in detail for the goal-oriented case: detection of these failures requires stateful trajectory modeling against a formal representation of the authorized objective, and neither exists at production scale today. A credible coverage report says so. It does not round the gap to a percentage.

The Open Problem, Named Precisely

None of the three properties is currently produced by any vendor at production scale. The Coverage Illusion is not solved by better red teaming, better tooling, or more rigorous procurement. It is solved by changing what coverage reports against — anchoring the metric to architectural failure modes, naming the deployment-specific assumptions probed, and disclosing uncertainty where coverage is incomplete. That change has not happened, and no current vendor product makes it.

This is the first of three cuts. The second turns to the gap between what assessments measure and what runtime behavior actually does. The third names what the two failures together leave on the floor. The Coverage Illusion is the entry point because coverage is the metric enterprises currently rely on most, and it is the one that gives the most false confidence. Naming it precisely is where the argument starts.

The Series Opening

The enterprise agentic AI security measurement apparatus produces numbers that look like security signal but are not. Coverage measures breadth across a surface taxonomy. It does not measure depth against the architectural failure modes that fail in production. The gap is structural — held open by red-team economics, vendor incentives, and procurement comparability. It is not closed by better testing. It is closed by changing what coverage reports against. Series 10 opens here. Two more cuts follow.

If this post raised questions worth exploring for your deployment

The structural measurement argument applies to specific architectural decisions. If it is useful to think through what it means for your environment, the conversation starts here.

Schedule a Conversation →
Series 10  ·  The Measurement Problem
Post 01 · Now Reading The Coverage Illusion
Post 02 · Published The Posture Mirage
Post 03 · Published No Floor Underneath
References & Sources

Share this:

Like this:

Like Loading…