The enterprise security architect carries two artifacts. The coverage report from the red-team engagement. The posture attestation from the audit. Each is the product of real work. Each is signed by parties with legitimate authority to sign. Each compares cleanly across vendors and is interpretable inside the procurement process. Together they constitute what the field treats as the security-posture floor for an agentic AI deployment. The architect stands on the two artifacts the way an engineer stands on certified materials.
The two artifacts also do not establish what the architect believes they establish. Post 01 named the first failure: coverage measures breadth across a surface taxonomy organized around enumerable input transformations, not depth against the architectural failure modes that fail in production. Post 02 named the second: posture assessment measures declared state at the audit moment, not runtime behavior, and the four mechanisms that drive the assessed-versus-deployed gap are structural. This post takes the next step. The two failures compound in a specific way: even if one is fixed, the other still leaves the architect without a credible measurement of agentic security posture. What is left on the floor, after both failures are accounted for, is the architectural absence that Series 12 named from the detection side. The two findings are the same gap, reached from two angles.
The Two Failures, Restated for Synthesis
Post 01 named the Coverage Illusion. Coverage metrics report breadth across an attack-surface taxonomy — number of prompt injection variants tested, number of jailbreak techniques attempted, number of tool-call abuse patterns probed. The taxonomy is organized around what is enumerable from outside the system. The architectural failure-mode taxonomy is organized around where in the system the trust boundary breaks — composition failures, authorization layer placement, trust boundary inversion under retrieval-augmented context, goal-trajectory drift outside the scripted perimeter. The two are not the same. Triedman et al. (2025) demonstrated the gap empirically: a multi-agent system can pass comprehensive prompt-injection coverage and remain trivially exploitable through a cross-agent execution path the surface taxonomy did not include.
Post 02 named the Posture Mirage. Static posture assessment measures declared state at the audit moment — configurations enumerated, controls mapped, model cards reviewed, policy documents attached. Within scope, the methodology produces accurate measurements. Outside scope, four mechanisms make the assessed-versus-deployed gap structural: tool-call resolution drift against discovered surfaces, trajectory unboundedness in goal-oriented agents, policy enforcement at the model layer rather than the protocol layer, and deployment surface unboundedness as MCP servers, retrieval indexes, and downstream agents are added at runtime. Continuous assessment does not close the gap because the gap is not in the cadence. VeriGuard (Miculicich et al., October 2025) is the architectural admission: the runtime monitor exists because offline assessment does not generalize to runtime. CLTR Scheming in the Wild documents the divergence empirically.
Two failures. Two artifacts. Two attestations the enterprise security architect carries forward into procurement. Each is real. Each is also organized around a different property of the system. The synthesis question this post takes up is what the two artifacts together establish — and what they do not.
The Compounding Effect
The two failures compound in a specific way. Each closes a different escape route from the gap. Each leaves the other in place. The compound is not additive — it is structural, in the sense that fixing either failure independently does not rescue the other. Both diagnoses can be sharpened, both methodologies can be improved, and the architectural absence underneath remains.
Fix coverage. Posture is still wrong.
Suppose the field shipped tomorrow what Post 01 asked for: coverage anchored to an architectural failure-mode taxonomy, with explicit reporting of which architectural assumptions were probed against the deployment’s specific composition pattern, and uncertainty disclosure where coverage is incomplete because the failure mode is hard to test. The Coverage Illusion is closed. The architect now has a coverage report that probes the right things and reports honestly on where the probing fell short.
The Posture Mirage is unchanged. The agent in production still operates against a discovered tool surface the manifest could not enumerate, generates trajectories the assessment could not list, enforces policy at a layer the audit cannot inspect, and faces a deployment surface that has already moved since the audit window closed. The fixed coverage report describes the test plan against the deployment as it existed at testing time. The deployment in production is not that deployment. Coverage at assessment time, however well-anchored, is not behavior at runtime.
Fix posture. Coverage is still wrong.
Suppose instead the field shipped what Post 02 pointed toward: runtime checking against architectural invariants, with deterministic enforcement at the protocol layer when invariants are violated. The Posture Mirage is closed. The architect now has runtime measurement of behavioral conformance to a stated policy.
The Coverage Illusion is unchanged. Runtime checking enforces the policy that was written. If the policy was written against a surface taxonomy and missed the architectural failure modes — composition failures, cross-agent execution paths, authorization layer placement, goal-trajectory drift — runtime enforcement faithfully holds the enforced policy, which was the wrong policy. VeriGuard’s own caveat is the cleanest illustration: the paper’s formal soundness depends on manual validation of the policy constraints. If the constraints were written against the surface taxonomy, the soundness guarantee is sound about the wrong thing. Runtime enforcement of an incomplete policy is not behavioral assurance against architectural failure modes.
What’s Left on the Floor
The two failures together describe two zones of measurement and one absence between them.
The coverage zone is what the surface taxonomy reaches: enumerable input transformations probed by red teams, with reports against published category vocabularies (OWASP LLM Top 10, MITRE ATLAS, AgentDojo, ASB). Real failures live here. Real defenses are tested against them. Coverage in this zone is meaningful as a measurement of test design effort.
The posture zone is what the attestation describes: declared state at the audit moment, mapped against control frameworks (SOC 2, ISO 27001, NIST AI RMF, AIUC-1). Real risk-transfer in the contractual sense lives here. Real regulatory and procurement requirements are satisfied. Within scope, posture assessment produces accurate measurements of what was declared.
Between them sits a third zone — the architectural failure surface that escapes both. Cross-agent execution paths under runtime composition. Tool-call authorization bypass through model-layer enforcement. Goal-trajectory drift in agents whose paths cannot be enumerated in advance. Policy violations that emerge from transitive tool access through dynamically-discovered MCP servers. Series 1 through Series 9 have catalogued these failure modes in detail; Series 12 documents the goal-oriented case in particular. The failures are real. They are also outside the surface taxonomy that coverage probes AND outside the static posture artifact that attestation produces. No current vendor product reaches this zone.
The enterprise architect reads the coverage report and the posture attestation and stands on what looks like solid measurement. Underneath both reports, the architectural failure surface is unmeasured. Not because the attestations were sloppy — they were not — but because each was organized around a different property of the system, and the properties that matter for runtime architectural failure are not what either was organized to measure. The “floor” the architect believes the reports establish is not there. The post’s title is the structural finding.
What Credible Measurement Would Require
Naming what is missing is the work this post is for. The architectural absence has a shape. Credible measurement of agentic AI security posture would have to satisfy three properties that the current measurement apparatus does not.
Behavioral, not declared. The artifact has to describe what the agent actually does under runtime composition — against discovered tool surfaces, against retrieved context, against goal specifications under load. Static-posture artifacts cannot do this work, not because they are flawed but because they were not designed for systems whose runtime behavior is generated rather than configured. The compliance frameworks (SOC 2, ISO 27001, AIUC-1) are doing exactly what they were designed to do; the design predates the object class being assessed. A credible behavioral artifact is not a faster-cadence version of a posture attestation. It is a different category of measurement.
Continuous in the runtime sense, not point-in-time. The measurement has to happen during execution — checking against architectural invariants as the agent runs, with deterministic enforcement when invariants are violated. This is not continuous control monitoring of declared state at higher cadence; it is runtime checking of behavior against architectural constraints. The early architectural templates exist: VeriGuard, Winston SMT, GuardAgent, AGrail. None is yet a vendor product the enterprise procures alongside the audit firm. The work is research; the gap to production scale is real.
Architecturally anchored, not behaviorally approximated. The invariants the measurement checks against must be architectural — tool-call mediation, authorization boundary placement, protocol-layer constraints, trajectory boundedness — not behavioral approximations checked probabilistically by another model. Probabilistic monitoring of probabilistic behavior compounds the uncertainty rather than resolving it. Architectural invariants are inspectable, deterministic, and auditable in the runtime sense. The structural-versus-probabilistic enforcement distinction and the protocol-layer enforcement frame are Luminity’s analytical scaffolding, developed across Series 1 through Series 9 and load-bearing here.
None of the three properties is currently produced by any vendor at production scale. The work to produce them is underway in research, in early-stage standards efforts, and in pilot architectures inside frontier labs. Series 11 will track the institutional progression. Today the floor is not there.
The Pairing — Measurement Gap and Detection Gap Are the Same Absence
The architectural absence Series 10 names from the measurement side is the same architectural absence Series 12 named from the detection side. The pairing is structural, and naming it directly is the gavel-down work this post is for.
Series 12 closed on the detection frontier. Detecting goal-oriented attacks requires stateful trajectory modeling against a formal representation of the authorized objective — neither exists at production scale. The series argued that goal hijacking is a categorically distinct attack class from prompt injection, that vertical stacks like Google’s deliver structural controls at the protocol boundary but cannot reach the goal layer below it, and that detection at the goal layer requires infrastructure the field has not built.
Series 10 closes on the measurement frontier. Measuring agentic AI security posture requires behavioral checking at runtime against architectural invariants — neither exists at production scale. The series has argued that coverage metrics measure the wrong thing, that posture assessments measure the wrong category of object, that the failures compound, and that closing the gap requires a measurement layer the field has not built.
Same architectural absence. Two angles. Detection asks how to see the failure when it happens. Measurement asks how to know the security posture is working in the first place. Both find no infrastructure where infrastructure is required. The field has two separate conversations — detection in one venue, measurement in another — and the missing layer behind both is the same. Naming the pairing changes the conversation: the question is no longer how to improve detection or how to improve measurement, but how to build the architectural enforcement layer that makes both possible.
The Honest Accounting
Series 10 does not deliver a product roadmap. There is no product. It does not deliver a vendor recommendation. There is no vendor that closes the gap. It does not deliver an audit checklist that fixes the methodology — the methodology is not what failed.
What it delivers is a structural diagnosis: the enterprise agentic AI security measurement apparatus has no architectural floor underneath it. Coverage probes the wrong taxonomy. Posture measures the wrong category of object. The two failures compound in a way that cannot be closed by better tooling, better cadence, or more rigorous procurement. Closing the gap requires a measurement layer that does not exist at production scale today, anchored to architectural enforcement rather than to declared controls, organized around behavior under runtime composition rather than around configuration at the audit moment.
The architect carrying the two artifacts forward into procurement is carrying real artifacts. The artifacts also do not establish what the procurement process treats them as establishing. The honest accounting is that the floor is not there, and the work to build it has not finished. Series 11 — The Standards Layer — picks up next, tracking how institutional standards bodies (OWASP ASI, NIST 800-53 agentic overlays, NCCoE agent identity work, protocol-level standards initiatives) are or are not building the architectural floor that S10 names as absent. The diagnosis closes here. The institutional progression starts there.
