The security community has treated goal hijacking as a subtype of prompt injection for long enough that the distinction has become blurred. Both involve adversarial content influencing model behavior. Both can enter the system through retrieved content or environmental observations. Both can cause an agent to take actions its operator did not intend. The surface-level similarity is real. The structural difference is what matters.
Prompt injection attacks a specific instruction in a specific context window. The damage is bounded to that interaction. A well-designed single-turn defense — input sanitization, output filtering, instruction separation — can address it because the malicious content and the affected output are both localized. The attack is discrete. The fix can be discrete.
Goal hijacking attacks the agent’s objective. In a goal-oriented agent, the objective is not a single instruction. It is the persistent internal state that drives the agent’s replanning across the full operational lifecycle — initialization, input, inference, decision, execution. A compromised objective does not affect one output. It redirects every subsequent plan, every subsequent tool selection, every subsequent action the agent takes until the goal is corrected or the execution terminates. The attack is persistent. A discrete fix cannot address a persistent structural compromise.
Why the OWASP Distinction Is Correct
The OWASP Top 10 for Agentic Applications 2026 makes this precise. ASI01 — Agent Goal Hijack — is designated as a separate vulnerability class from LLM01 — Prompt Injection. The framing is explicit: unlike traditional prompt injection, which is often transient, goal hijacking captures the broader agentic impact where manipulated inputs redirect goals, planning, and multi-step behavior. The designation is not taxonomic housekeeping. It is a structural argument that goal hijacking requires architectural-level defense rather than filter-level defense.
An attack that manipulates an agent’s core objectives, task selection, or decision-making pathways. Unlike prompt injection (LLM01), which is often transient and bounded to a single interaction, goal hijacking redirects goals, planning, and multi-step behavior across the agent’s entire execution chain. The primary defense is treating all natural-language inputs as untrusted and routing them through rigorous validation — a structural requirement, not a filter.
The distinction matters for how enterprises scope their defenses. A team that treats goal hijacking as advanced prompt injection will build better prompt injection defenses. That is not the same as building goal hijacking defenses. The threat model is different. The persistence is different. The propagation path is different. The measurement instruments are different.
Long-Horizon Attacks and the Failure of Single-Turn Defenses
Jiang, Wang, Liang, and Wang at Stony Brook (arXiv:2602.16901) demonstrate this empirically through AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long-horizon attacks. The five attack families AgentLAB covers — intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning — are not variants of a single prompt injection class. They are structurally distinct attack types that exploit different properties of extended user-agent-environment interactions.
644 test cases across 28 realistic agentic environments, 10 risk categories. The empirical finding: agents remain highly susceptible to long-horizon attacks, and defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. Single-turn defenses address single-turn attacks. They do not address attacks that operate across multiple turns and exploit the agent’s goal-persistence.
The objective drifting attack family is particularly revealing. It embeds objective-shifting content within environmental observations over extended interactions, gradually shifting the agent’s goal from a benign task to a malicious one. Individual injections appear benign — no single observation triggers a pattern-matching defense. The attack operates across time, accumulating influence on the agent’s persistent goal state. This is not prompt injection at scale. It is a fundamentally different attack that exploits a property — goal persistence — that scripted agents do not have.
Intent hijacking operates similarly: adversarial content progressively erodes safety guardrails through multi-turn social engineering, crafting contextual framings that gradually redirect the agent into executing a malicious task. The attack targets agent actions, not content generation. The affected output is not the immediate response — it is the tool calls the agent makes, the resources it accesses, the actions it takes on behalf of a user who believes it is still pursuing their original objective.
The Propagation Asymmetry
The structural difference between prompt injection and goal hijacking can be stated precisely in terms of propagation. A successful prompt injection affects the outputs that derive from the tampered instruction. A successful goal hijacking affects every output the agent produces until the goal corruption is detected and corrected. In a long-horizon agent, that can mean dozens or hundreds of tool calls, memory updates, and external actions, all executed under a compromised objective that the agent is pursuing in good faith.
- Attacks a specific instruction in a specific context
- Damage bounded to the affected interaction
- Single-turn propagation
- Discrete malicious input, discrete affected output
- Pattern-matching detection viable
- Filter-level defense can address it
- Attacks the agent’s persistent objective state
- Propagates across all subsequent planning and action
- Multi-turn, lifecycle-spanning propagation
- Objective-level compromise, not instruction-level
- Individual inputs may appear benign — no single trigger
- Architectural-level defense required
The OpenClaw lifecycle analysis (Deng et al., arXiv:2603.11619) documents this concretely: adversaries can inject structured instructions that cause an agent to reinterpret its objectives and prioritize malicious tasks. Even under benign conditions, ambiguous instructions can trigger intent drift — the agent’s working interpretation of its own goal degrades over the course of a long-horizon task without any single adversarial input. The goal corruption is not injected in one step. It accumulates.
What Defense Actually Requires
Defending against goal hijacking requires something that prompt injection defenses were not designed to provide: continuous verification that the agent’s active objective matches the objective it was initialized with. The OpenClaw analysis notes that OpenClaw lacks mechanisms to continuously verify the alignment of planned actions with user objectives, leaving operational constraints unenforced. Techniques like BlindGuard demonstrate that runtime intent verification and multi-agent monitoring can reduce such risks — but integration into general agent architectures remains limited.
Prompt injection defenses — input sanitization, retrieval filters, instruction separation, output filtering — operate on the content boundary. Goal hijacking defenses must operate on the objective boundary: continuously verifying that the agent’s active goal matches its authorized goal, across the full execution lifecycle. That is a monitoring architecture problem, not a content filtering problem. The tooling for it does not yet exist at production scale.
The AgentLAB benchmark makes the empirical case plainly: defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. This is not a finding about defense quality. It is a finding about defense category. The right defenses for prompt injection are the wrong defenses for goal hijacking — not because they are insufficient in degree, but because they are solving a different problem.
The Tsinghua analysis confirms what OWASP names architecturally: goal hijacking is recognized as a significant threat, but existing evaluation frameworks focus primarily on attack characterization rather than real-time mitigation. The field has measured the problem more thoroughly than it has addressed it. That gap — between a well-characterized attack class and a deployable mitigation architecture — is where the detection layer question in Post 4 lives.
Goal hijacking is not prompt injection. It is a structurally distinct attack class that exploits goal persistence — a property that scripted agents do not have and goal-oriented agents cannot function without. Defenses built for prompt injection address a transient, bounded, instruction-level threat. Goal hijacking requires defenses built for a persistent, propagating, objective-level threat. Building better prompt injection defenses does not close the gap.
