Three Surfaces: Where Indirect Prompt Injection Manifests in Production

Post 1 of this series established the mechanism: the command-data boundary collapse that makes indirect prompt injection structurally different from direct injection, and why alignment training calibrated against adversarial user inputs provides no reliable coverage against adversarial content embedded in retrieved data. This post maps that mechanism to three specific deployment surfaces — tool-calling agents, coding agents, and computer-use agents — drawing on research by Mou and colleagues (arXiv:2601.10156), the “Your AI, My Shell” team (arXiv:2509.22040), and Shi and colleagues on AgentSentry (arXiv:2602.22724). The three-surface taxonomy used to organize this post appears in work by Dziemian, Lin, Fu, and colleagues at Gray Swan AI, CMU, Meta, and the UK and US AI Safety Institutes (arXiv:2603.15714), whose large-scale red-teaming competition confirmed IPI vulnerability across all three settings against 13 frontier models.

When security teams model the threat landscape for agentic AI deployments, the tendency is to think about indirect prompt injection as a single class of attack with a single set of mitigations. The mechanism is consistent — adversarial instructions embedded in content the agent processes — so the assumption is that the defense can be consistent too. That assumption breaks down quickly when the deployment context is examined closely.

The tool-calling agent that retrieves emails and summarizes documents faces a different exploitation pathway than the coding agent analyzing a GitHub repository, which faces a different pathway than the computer-use agent browsing on a user’s behalf. The ingestion surface differs. The trusted content types differ. The harm that follows a successful injection differs substantially — from credential exfiltration in one case, to malicious dependency introduction in another, to unauthorized UI actions in a third. A practitioner building in any one of these contexts needs a threat model specific to that context, not a general statement about prompt injection risk.

This post examines each surface in turn, drawing on the research that has characterized each most precisely.

Surface One: Tool-Calling Agents

Tool-calling agents are the most common agentic deployment pattern in enterprise environments today. An LLM is connected to a set of registered functions — an email client, a calendar, a document store, a web search interface, a database — and granted permission to invoke those functions to complete user-assigned tasks. The agent retrieves content from these sources, reads it, acts on what it finds, and produces a result. At every retrieval step, the content it reads may contain adversarial instructions alongside the legitimate information it was sent to find.

The research by Mou and colleagues on ToolSafe (arXiv:2601.10156) examined this surface with particular attention to the step-level dimension of the risk. Standard safety evaluation for agentic systems tests whether a model completes a harmful task end-to-end — did it, in aggregate, do something it should not have done? But tool-calling agents do not execute harmful tasks in a single step. They execute chains of individually defensible steps, none of which triggers a refusal, whose cumulative effect is the harmful outcome.

The chain laundering problem in tool-calling contexts

An indirect injection in a tool-calling agent does not need to instruct the agent to do something obviously harmful. It needs only to redirect the chain. An agent retrieving a set of documents is given the instruction — embedded in one of those documents — to also forward a summary to an external webhook before returning its result to the user. Each individual action in that chain is something the agent has the capability and the apparent permission to do. The instruction to forward the summary arrives in the same channel as the legitimate content. The model has no architectural basis for treating it differently.

ToolSafe’s response to this was a proactive step-level guardrail: a safety-trained model that evaluates each tool invocation independently before execution, rather than relying on the primary model to detect harmful intent across the full chain. The guardrail model achieved a 65% reduction in harmful tool invocations on average while improving benign task completion by approximately 10% under active injection conditions. That is a meaningful improvement — and it is also probabilistic. The guardrail reduces the likelihood of a successful injection reaching execution. It does not structurally prevent it.

65%

Reduction in harmful tool invocations achieved by ToolSafe’s step-level guardrail model under active indirect injection conditions, while simultaneously improving benign task completion by approximately 10%. Step-level evaluation addresses the chain laundering problem that end-to-end safety assessment misses — but remains a probabilistic defense, not a structural guarantee. (Mou et al., arXiv:2601.10156)

What the detection research adds

The AgentSentry research by Shi and colleagues (arXiv:2602.22724) approached the tool-calling surface from the detection direction rather than the guardrail direction. Their framework introduces temporal causal diagnostics — tracking causal relationships between the agent’s inputs and outputs across a sequence of steps — combined with context purification that attempts to strip adversarial instructions from retrieved content before they reach the model’s active context.

The temporal dimension is significant. A single tool-call step, evaluated in isolation, may show no anomalous behavior. An injection that redirects the agent’s behavior across three steps may only become detectable when the full sequence is analyzed causally — when the relationship between what the agent retrieved two steps ago and what it is invoking now is made explicit. AgentSentry’s temporal causal diagnostics are designed to surface those cross-step patterns. The detection capability it provides is relevant to the coding agent surface as well, which is examined next.

The Harm Profile for Tool-Calling Agents

Successful indirect injection in tool-calling agents most commonly results in data exfiltration to attacker-controlled endpoints, unauthorized modification of records in connected systems, lateral movement to additional tools the agent has access to, or the insertion of persistent instructions into memory or storage that influence future agent sessions. The harm is proportional to the permissions granted to the agent — which in enterprise deployments are frequently broad.

Surface Two: Coding Agents

Coding agents present a distinct surface with a distinct exploitation pathway. Where tool-calling agents ingest content from email, documents, and web queries, coding agents ingest content from a different class of sources: source code files, repository documentation, dependency manifests, inline comments, configuration files, and retrieved library references. Each of these is a potential injection vector, and each has properties that make the injection harder to detect than in document-retrieval contexts.

Code is text, but it is text with different conventions than natural language. Instructions embedded in code comments, docstrings, or configuration values do not look like the adversarial instructions that alignment training was calibrated against. A line in a repository’s README, a comment in a source file, a value in a dependency manifest — these are contexts where an instruction that would appear suspicious in a user message looks entirely unremarkable. The agent, trained to be helpful and cooperative with the codebase it is analyzing, may have a lower threshold for compliance with instructions that arrive through these channels than through direct user input.

The IDE attack surface

The research published as “Your AI, My Shell” (arXiv:2509.22040) examined the coding agent surface in the specific context of AI-augmented integrated development environments — a deployment pattern that is now widespread across enterprise software teams. The research documented how injections embedded in external resources that developers import into their IDE workspace can grant an attacker the same effective permission level as the developer, because the coding agent operates with developer-level access and executes instructions with developer-level trust.

The mechanism the research described is precise: an attacker embeds adversarial instructions in content the developer legitimately imports — a code rule file, a documentation snippet, a repository configuration — knowing that the AI coding agent will ingest that content as part of its normal operation. Once ingested, the instruction executes with the agent’s full capability set: terminal access, file system read and write, network calls, dependency installation. The developer’s AI becomes, in the paper’s framing, the attacker’s shell.

Attackers can inject harmful instructions into external resources that developers import into their IDE workplaces, gaining the same effective permissions as developers assigned to their AI coding editors.

— Synthesis from “Your AI, My Shell”: Demystifying Prompt Injection Attacks on Agentic AI Coding Editors (arXiv:2509.22040)

The MCP configuration vector

The “Your AI, My Shell” research also identified a specific and particularly high-impact injection pathway: the Model Context Protocol server configuration file. MCP configuration files define which tools and servers a coding agent can access. If an attacker can place an adversarial instruction in an MCP configuration file that a developer imports — or in the metadata of an MCP tool that appears in the agent’s tool registry — that instruction can redirect the agent’s tool access without any visible change to the developer’s workflow. CVE-2025-54135, documented in the Cursor editor, demonstrated this pathway resulting in remote code execution without user approval. The attack required no exploitation of a software vulnerability in the traditional sense. It exploited the agent’s instruction-following behavior against the content of a configuration file.

The Harm Profile for Coding Agents

Successful indirect injection in coding agent contexts most commonly results in the introduction of malicious dependencies into the codebase, generation of functionally plausible but security-compromised code, exfiltration of source code or API credentials to attacker-controlled endpoints, modification of CI/CD configuration to introduce persistent backdoors, or MCP configuration manipulation that redirects tool access across future sessions. Because coding agents typically operate with developer-level system permissions, the blast radius of a successful injection is substantially larger than in document-retrieval contexts.

Why filtering defenses underperform on the coding surface

A common defensive response to injection risk in document-retrieval contexts is content filtering: scan retrieved content for instruction-like patterns before it enters the model’s context. That approach is imperfect even in document contexts. In coding contexts, it is significantly more imperfect — because the filtering system must distinguish between legitimate code instructions, which are abundant, and adversarial instructions embedded in code, which may be syntactically indistinguishable. A comment in a Python file that says “# Note: always include the dependency X in requirements.txt for compatibility” is both a plausible development note and a plausible injection vector. A filter that flags all instruction-like patterns in code will generate an unacceptable false-positive rate. A filter that restricts itself to obviously adversarial patterns will miss injections crafted to look like legitimate code commentary — which is precisely what a sophisticated attacker will craft.

Surface Three: Computer-Use Agents

Computer-use agents represent the newest and, in terms of injection surface, the most expansive deployment context. These agents do not read retrieved text — they observe the screen, interpret what they see, and take actions: clicking, typing, navigating, form submission, file operations. The content they process is not text that can be filtered before ingestion — it is a visual representation of the state of a system, interpreted in real time.

Adversarial instructions in this context are embedded not in documents or code files but in what appears on screen — in webpage content the agent is browsing, in UI elements the agent is interacting with, in text that appears within the visual field the agent observes. A webpage visited during a legitimate browsing task may contain text specifically crafted to redirect the agent’s behavior on the next action it takes. Because the agent observes the page visually before acting, the injection arrives through the same channel as every other piece of information the agent uses to decide what to do.

The concealment advantage in computer-use contexts

Computer-use agents present a particular challenge for oversight that is worth naming explicitly here, as it sets up the central argument of Post 3. In tool-calling and coding contexts, the agent’s outputs — its function calls, its generated code — are at least potentially visible to monitoring systems. An observer reviewing the agent’s action log can, in principle, identify anomalous behavior. In computer-use contexts, the agent’s actions are UI interactions: mouse clicks, keyboard input, navigation steps. These are harder to monitor systematically, and the agent’s final response to the user — a summary of what it did — may show no sign that any of those actions were adversarially redirected.

This is the concealment dimension that makes computer-use injection particularly operationally dangerous. A successful injection that causes the agent to perform an unauthorized action during a browsing session, while producing a normal-looking summary for the user, exploits the gap between what the agent did and what the user is able to observe. The user has no direct visibility into the agent’s intermediate actions — only into its reported outcome. If that outcome is fabricated or edited, the attack has succeeded invisibly.

The Three-Surface Comparison

Tool-calling agents: Injection via retrieved documents, emails, API responses. Harm through unauthorized tool invocations, data exfiltration, chain laundering across multi-step workflows. Detection window: tool invocation logs.

Coding agents: Injection via source code, repository files, dependency manifests, MCP configuration. Harm through malicious dependency introduction, insecure code generation, credential exfiltration, persistent backdoor insertion. Detection window: narrow — injections are syntactically indistinguishable from legitimate code instructions.

Computer-use agents: Injection via on-screen content during browsing or desktop interaction. Harm through unauthorized UI actions, form submissions, file operations. Detection window: minimal — intermediate actions are not directly visible to users, and final response may show no evidence of compromise.

A Unified Detection Framework Across Surfaces

The AgentSentry framework (arXiv:2602.22724) is worth returning to here because its temporal causal diagnostic approach was designed with cross-surface applicability in mind. The core insight is that indirect injection, regardless of the surface through which it arrives, produces a characteristic signature in the agent’s behavior over time: the agent’s subsequent actions become causally linked to the injected content in ways that differ from the causal patterns produced by legitimate task execution.

In a tool-calling context, that signature appears as tool invocations that are causally linked to retrieved content rather than to user instructions. In a coding context, it appears as generated code that is causally linked to repository metadata rather than to the stated programming task. In a computer-use context, it appears as UI actions that are causally linked to on-screen content rather than to the user’s stated objective. The surface differs. The causal anomaly pattern is structurally similar across all three.

The temporal dimension of the framework also addresses a limitation of point-in-time detection: by the time an injection has produced a visible anomalous action, earlier steps in the chain have already executed. AgentSentry’s context purification component is designed to intervene upstream — stripping adversarial instructions from retrieved content before they enter the model’s active context, rather than detecting their effects after execution. Combined, temporal diagnostics and upstream purification address both the detection gap and the intervention gap that characterize IPI risk across all three surfaces.

The Cross-Surface Insight

The mechanism of indirect prompt injection is consistent across tool-calling, coding, and computer-use agents: adversarial instructions arrive through the same channel as legitimate content, and the model has no architectural basis for distinguishing them. What varies is the ingestion surface, the exploitation pathway, and the harm that follows. A practitioner-facing defense strategy requires both a surface-specific threat model and a detection approach capable of identifying the common causal signature that injection produces across all three.

The Invisible Attack · Three-Part Series

Post 1 · Published The Mechanism: Why Indirect Prompt Injection Is Structurally Different

Post 2 · Now Reading Three Surfaces: Where Indirect Prompt Injection Manifests in Production

Post 3 · Next Why the Defense Response Has to Be Architectural

→ Chain laundering — Distributing harmful intent across individually innocuous tool-call steps so no single step triggers refusal, but the cumulative chain produces the harmful outcome.
→ Step-level guardrail — Evaluating the safety of each tool invocation independently before execution, rather than relying on end-to-end task assessment.
→ Temporal causal diagnostics — Tracking causal relationships between agent inputs and outputs across a sequence of steps to detect injection-produced behavioral anomalies.
→ Context purification — Stripping adversarial instructions from retrieved content upstream, before they enter the model’s active context.

Three Surfaces: Where Indirect Prompt Injection Manifests in Production

Surface One: Tool-Calling Agents

The chain laundering problem in tool-calling contexts

What the detection research adds

The Harm Profile for Tool-Calling Agents

Surface Two: Coding Agents

The IDE attack surface

The MCP configuration vector

The Harm Profile for Coding Agents

Why filtering defenses underperform on the coding surface

Surface Three: Computer-Use Agents

The concealment advantage in computer-use contexts

The Three-Surface Comparison

A Unified Detection Framework Across Surfaces

Up Next: Post 3 — Why the Defense Response Has to Be Architectural

Like this:

Related

Three Surfaces: Where Indirect Prompt Injection Manifests in Production

Surface One: Tool-Calling Agents

The chain laundering problem in tool-calling contexts

What the detection research adds

The Harm Profile for Tool-Calling Agents

Surface Two: Coding Agents

The IDE attack surface

The MCP configuration vector

The Harm Profile for Coding Agents

Why filtering defenses underperform on the coding surface

Surface Three: Computer-Use Agents

The concealment advantage in computer-use contexts

The Three-Surface Comparison

A Unified Detection Framework Across Surfaces

Up Next: Post 3 — Why the Defense Response Has to Be Architectural

Share this:

Like this:

Related