The Instruction Illusion: Why CLAUDE.md Files Cannot Govern Agent Behavior

A January 2026 systematization of knowledge paper synthesizing findings from 78 studies spanning 2021 through 2026 concluded that attack success rates against state-of-the-art defenses exceed 85% when adaptive attack strategies are employed against agentic coding assistants. The paper documented concrete exploit chains targeting skill-based architectures — vulnerabilities not previously disclosed in the public literature. The pattern is consistent: instruction files designed to constrain agent behavior are being bypassed at the execution substrate layer, where no configuration file has authority.

There is a particular kind of false confidence in enterprise AI deployments today. Organizations configure their agents carefully — they write CLAUDE.md files, establish system prompts, define behavioral boundaries, and document the rules of engagement. Then they deploy. And they assume the rules will hold. What the security research now makes clear is that this assumption rests on a fundamental architectural misunderstanding: configuration-layer instructions and runtime-layer governance are not the same thing, and confusing the two is creating exploitable gaps at production scale.

CLAUDE.md is a persistent instruction file used with Claude Code that provides behavioral rules, project context, and operational guidelines to an agent working within a codebase. It is a powerful and well-designed mechanism for its intended purpose. The problem is not the file itself. The problem is what happens when an agent reads other files — repository READMEs, code comments, external documents, API responses, web content — as part of executing a task. Any of those sources can contain adversarial instructions. And when they do, those instructions compete with the CLAUDE.md directives for authority over the agent’s next action. The competition is not always resolved in favor of the configuration file.

85%+

Attack success rates against current defenses when adaptive injection strategies are used against agentic coding assistants. A 2026 meta-analysis of 78 studies found that just five carefully crafted documents can manipulate AI agent responses up to 90% of the time through retrieval-augmented contexts. Prompt injection has remained the OWASP LLM Top 10’s number-one risk since the list’s inception.

What the Research Actually Shows

The OWASP Top 10 for LLM Applications 2025 designates prompt injection — and specifically its indirect variant — as LLM01:2025, the primary threat to large language model deployments. The classification distinguishes between direct injection, where a user explicitly attempts to override model behavior, and indirect injection, where the override arrives embedded in content the agent processes during normal task execution. The indirect variant is the one that undermines instruction files, because it does not require an adversary to interact with the agent directly. It only requires that the agent, doing its job, reads something the adversary has influenced.

Researchers from Secure Code Warrior demonstrated this in April 2025 against Claude Code specifically: when instructed to perform a task without prior context, the agent naturally looks for documentation files such as READMEs to build up situational context. A malicious README containing embedded instructions was sufficient to redirect agent behavior. The researchers also discovered that rules files — the persistent configuration layer analogous to CLAUDE.md — represent an ideal persistence target: once an injection succeeds in writing to a rules file, the adversarial instruction survives across sessions.

The coercion of human-in-the-loop agentic actions and fully autonomous agentic workflows are the new attack vector for hackers. The attackers used AI to carry out 80–90% of the operation — reconnaissance, exploit development, credential harvesting, lateral movement, and data exfiltration — with humans stepping in only at a handful of key decision points.

— MIT Technology Review — “Rules fail at the prompt, succeed at the boundary,” January 28, 2026

The MIT Technology Review’s January 2026 analysis of a documented state-sponsored espionage campaign captures what adversarial exploitation of agentic systems looks like at operational scale. The attack leveraged Claude Code combined with Model Context Protocol (MCP) tool access. The agent was not hacked in any technical sense — it was persuaded. The attack succeeded by decomposing a sophisticated operation into small, individually plausible-looking tasks, and by establishing a fictional authorization context that the agent accepted as legitimate. No CLAUDE.md file would have stopped it, because the attack operated at the execution layer, not the configuration layer.

Three Injection Vectors That Bypass Instruction Files

Understanding why configuration files are insufficient requires understanding where injection actually occurs. The research identifies three structurally distinct attack surfaces, each capable of overriding instruction-layer governance. In practice, production agents are exposed to all three simultaneously.

Attack Surface

Environmental Content Injection

The agent reads files it encounters during task execution — READMEs, documentation, code comments, configuration files, commit messages, issue descriptions. Any of these can contain embedded adversarial instructions invisible to users but processed by the model. The OWASP LLM Prompt Injection Prevention Cheat Sheet identifies code comments and documentation analyzed by AI coding assistants as primary indirect injection vectors.

Why Config Files Cannot Address This

Context Is Not Instruction

Instruction files establish behavioral intent at session initialization. Environmental content enters the reasoning loop continuously during execution. The model processes both as natural language — it has no architectural mechanism to distinguish “this text is my operating instruction” from “this text is data I am analyzing.” Separation of instruction and data planes requires infrastructure, not configuration.

OWASP LLM01:2025 — Prompt Injection, genai.owasp.org

Attack Surface

Tool Poisoning via MCP

Model Context Protocol enables agents to access external tools, data sources, and services. A poisoned MCP tool can embed instructions in its response that redirect agent behavior — credential harvesting, lateral movement, exfiltration — framed as legitimate tool output. Invariant Labs demonstrated this class of attack, with tool descriptions containing hidden directives that were processed as authoritative by the agent.

Why Config Files Cannot Address This

No Independent Policy Layer

The MIT Technology Review analysis identified the root cause precisely: there was no independent policy layer saying “this environment may only perform actions against assets labeled internal.” CLAUDE.md cannot enforce tool-level scope restrictions at runtime. That requires a harness with MCP-aware access controls, tool output sanitization, and behavioral anomaly detection — infrastructure operating between the tool and the model’s reasoning context.

MIT Technology Review — January 28, 2026

Attack Surface

Rules File Persistence & Config Manipulation

CVE-2025-53773, documented in GitHub Copilot, demonstrated that a prompt injection payload delivered through a GitHub issue could instruct the agent to modify its own configuration files, establishing persistent autoApproval of subsequent agent actions. The same attack class applies to any agent with write access to its own rules or configuration directory — which includes agents whose CLAUDE.md is located within the working directory they are authorized to modify.

Why Config Files Cannot Address This

Configuration Is a Write Target

When an agent has file system access to the directory containing its instruction files, those files become part of the attack surface rather than part of the defense. CVE-2025-54794 and CVE-2025-54795 demonstrated command injection through Claude Code’s whitelisted command handling — illustrating that the boundary between permitted and restricted actions can be crossed without triggering configuration-layer guardrails.

Cymulate — CVE-2025-54794 & CVE-2025-54795, August 2025

The Architectural Root Cause

The research converges on a single structural explanation for why prompt injection remains unsolved despite years of awareness. OWASP states it directly: prompt injection vulnerabilities exist because LLMs cannot reliably separate instructions from data. The model processes both in natural language. It has no native capacity to distinguish between “this is my operating directive” and “this is content I am analyzing.” Every piece of text in the context window competes for interpretive weight, and the competition is settled probabilistically — not architecturally.

This means the model itself cannot be the enforcement boundary. Anthropic’s own research on Claude Opus 4.5 — which demonstrated meaningful improvement in browser-use prompt injection resistance through reinforcement learning and classifier-based detection — makes the same point from the opposite direction: even state-of-the-art defenses built into the model’s training leave meaningful residual risk, and Anthropic acknowledges that prompt injection is far from a solved problem, particularly as agents take more real-world actions. The improvements are real. The gap remains significant.

The Core Distinction Enterprise Architects Must Make

Instructions tell an agent what to do. Governance controls what the agent is able to do. CLAUDE.md and system prompts operate at the instruction layer. Runtime access controls, input provenance tracking, context boundary enforcement, and behavioral anomaly detection operate at the governance layer. Enterprises that conflate these two layers are building agents with instructions but without governance — and the research documents precisely what that looks like under adversarial conditions.

What Governance at the Harness Layer Requires

The NIST AI Risk Management Framework’s Generative AI Profile (AI 600-1), released in July 2024, establishes that generative AI systems require continuous risk management across their operational lifecycle — not point-in-time configuration. The framework’s GOVERN, MAP, MEASURE, and MANAGE functions all presuppose ongoing runtime oversight, not static instruction files. Applied to agentic systems operating in enterprise environments, this translates into four infrastructure requirements that no configuration file can fulfill.

1. Input Provenance Tracking

Every piece of content that enters an agent’s reasoning context should carry provenance metadata: where it came from, when it was retrieved, what trust tier the source belongs to. Environmental content — files, API responses, web fetches — should be tagged as untrusted prior to processing. Trusted system instructions, by contrast, should arrive through a sealed channel that cannot be overwritten by environmental content. This is not prompt engineering. It is infrastructure design, and it must be enforced at the harness layer before content reaches the model.

2. Context Boundary Enforcement

The OWASP LLM Prompt Injection Prevention Cheat Sheet recommends separating and clearly denoting untrusted content to limit its influence on user prompts. In practice, this requires the harness to maintain distinct context partitions — system instruction context versus environmental data context — and to enforce that environmental content cannot be interpreted at the instruction tier. This is an architectural constraint, not a prompting strategy. It must be implemented in the infrastructure that mediates between external content and the model’s context window.

3. Behavioral Anomaly Detection

A successful injection typically manifests as a deviation from expected task scope — the agent attempts an action not implied by its original directive, accesses a resource outside its declared operating domain, or produces output inconsistent with its behavioral baseline. Runtime monitoring that tracks what the agent is doing against what it was asked to do provides an independent detection layer that does not depend on the model’s own judgment about whether it is being manipulated. The CSA’s MAESTRO framework for agentic AI threat modeling, released in February 2025, specifically addresses this class of monitoring as a foundational control for autonomous reasoning systems.

4. Least-Privilege Tool Access with Independent Policy Enforcement

Tool access granted to an agent through MCP or other integration mechanisms should be governed by an independent policy layer that the agent itself cannot modify or reason around. The MIT Technology Review analysis identified the absence of this layer as the critical gap in the compromised agentic deployment: there was nothing between the agent’s decision to use a tool and the tool’s execution that enforced scope restrictions. Least-privilege access, tool output sanitization, and action confirmation thresholds must be enforced by the harness, not requested of the model.

The Prevailing Practice

Configuration-Layer Governance

Write comprehensive CLAUDE.md files. Define behavioral rules in system prompts. Trust the model to apply constraints during task execution. Assume that well-written instructions will hold under adversarial environmental conditions.

Result: Instruction files become the target rather than the boundary. Injection through environmental content, tool responses, and configuration file manipulation succeeds at rates exceeding 85% under adaptive attack strategies.

Instructions Without Governance

The Required Architecture

Harness-Layer Governance

Enforce input provenance tracking before content reaches the model. Maintain context boundary separation between system instructions and environmental data. Monitor agent behavior against task scope baselines. Enforce tool access through independent policy layers the agent cannot modify.

Result: Injection attempts are detected at the infrastructure boundary rather than delegated to model judgment. Adversarial instructions from environmental content cannot reach the instruction-tier context.

Runtime Enforcement

The Harness as the Actual Defense Surface

What the research makes clear is that the security of agentic AI systems is not determined by which foundation model is deployed or how carefully its instruction files are written. It is determined by the infrastructure surrounding the model — the layer that controls what enters the context window, what tools are accessible under what conditions, what behavioral deviations trigger intervention, and what actions require explicit authorization before execution.

This framing aligns with the emerging NIST approach to AI security controls: the Cybersecurity Framework Profile for AI (NIST IR 8596, preliminary draft 2025) positions AI security as a continuous operational discipline spanning GOVERN, IDENTIFY, PROTECT, DETECT, RESPOND, and RECOVER functions — a lifecycle framework, not a configuration checklist. Applying those functions to agentic systems means treating the harness as the primary enforcement mechanism and the model as an untrusted component operating within it.

Given the stochastic nature of generative AI, it is unclear if there are foolproof methods of prevention for prompt injection. However, the following measures can mitigate the impact. Separate and clearly denote untrusted content to limit its influence on user prompts. Conduct adversarial testing and attack simulations.

— OWASP Top 10 for LLM Applications 2025

OWASP’s acknowledgment that the model itself cannot fully solve this problem is not a counsel of despair. It is a precise statement about where the responsibility lies: not in better prompting, but in better infrastructure. The organizations that understand this distinction are building harnesses with provenance-aware context management, behavioral monitoring, and independent policy enforcement. The organizations that do not understand it are writing longer CLAUDE.md files and calling it governance.

What Genuine Agentic Security Requires in Practice

First, treat every source of environmental content — files, APIs, web fetches, tool responses — as untrusted by default, and enforce that classification at the infrastructure layer before content reaches the model. Second, maintain architectural separation between system instruction context and environmental data context. Third, implement behavioral monitoring that detects scope deviation independent of the model’s self-assessment. Fourth, enforce tool access through an independent policy layer that the agent cannot reason around, modify, or invoke outside its declared scope. Fifth, conduct adversarial testing against environmental injection vectors before production deployment — not after. The model is downstream of all five controls. That is where the engineering work of 2026 actually lives.

Practitioner Takeaway

A CLAUDE.md file is an instruction. An Agent Harness is a governance system. The distinction is not semantic — it is the difference between telling an agent what to do and controlling what an agent is able to do. Every enterprise deploying agentic AI into production environments with file system access, external tool integrations, or MCP connections needs to build at the governance layer, not only at the instruction layer. The research documents, at operational scale, what happens when organizations get this wrong. The blueprint for getting it right is in the infrastructure.

OWASP LLM01:2025 — Prompt injection remains the top LLM vulnerability, with the indirect variant explicitly identified as the primary risk for agentic systems

arXiv SoK (January 2026) — Meta-analysis of 78 studies finds adaptive attack success rates exceeding 85% against current defenses

Secure Code Warrior (April 2025) — Demonstrated CLAUDE.md bypass via README injection and rules file persistence in live Claude Code testing

MIT Technology Review (January 2026) — Documented state-sponsored espionage campaign using agentic Claude Code plus MCP; 80–90% of operation automated

Cymulate (August 2025) — CVE-2025-54794 and CVE-2025-54795: command injection bypassing Claude Code’s whitelisted command handling without user confirmation

NIST AI 600-1 — Generative AI Risk Management Profile: requires continuous operational risk management across the AI lifecycle, not point-in-time configuration

NIST IR 8596 — Cybersecurity Framework Profile for AI: establishes GOVERN, IDENTIFY, PROTECT, DETECT, RESPOND, RECOVER functions for AI systems

CSA MAESTRO (February 2025) — Agentic AI threat modeling framework addressing autonomy-driven risks including prompt-to-action vulnerabilities

The Instruction Illusion: Why CLAUDE.md Files Cannot Govern Agent Behavior

What the Research Actually Shows

Three Injection Vectors That Bypass Instruction Files

Environmental Content Injection

Context Is Not Instruction

Tool Poisoning via MCP

No Independent Policy Layer

Rules File Persistence & Config Manipulation

Configuration Is a Write Target

The Architectural Root Cause

The Core Distinction Enterprise Architects Must Make

What Governance at the Harness Layer Requires

1. Input Provenance Tracking

2. Context Boundary Enforcement

3. Behavioral Anomaly Detection

4. Least-Privilege Tool Access with Independent Policy Enforcement

Configuration-Layer Governance

Harness-Layer Governance

The Harness as the Actual Defense Surface

What Genuine Agentic Security Requires in Practice

The Instruction Illusion — March 2026

Share this: