Goal Hijacking Is Not Prompt Injection

Post 1 of this series established the structural discontinuity between scripted and goal-oriented agent threat surfaces. The scripted model produced a bounded, enumerable list of failure modes. The goal-oriented model generates its attack surface at runtime from objective and environment — a trajectory space rather than a list. This post examines the most consequential practical implication of that discontinuity: it produces a new attack class that is not prompt injection, is not a variant of prompt injection, and is not adequately addressed by defenses designed for prompt injection.

The security community has treated goal hijacking as a subtype of prompt injection for long enough that the distinction has become blurred. Both involve adversarial content influencing model behavior. Both can enter the system through retrieved content or environmental observations. Both can cause an agent to take actions its operator did not intend. The surface-level similarity is real. The structural difference is what matters.

Prompt injection attacks a specific instruction in a specific context window. The damage is bounded to that interaction. A well-designed single-turn defense — input sanitization, output filtering, instruction separation — can address it because the malicious content and the affected output are both localized. The attack is discrete. The fix can be discrete.

Goal hijacking attacks the agent’s objective. In a goal-oriented agent, the objective is not a single instruction. It is the persistent internal state that drives the agent’s replanning across the full operational lifecycle — initialization, input, inference, decision, execution. A compromised objective does not affect one output. It redirects every subsequent plan, every subsequent tool selection, every subsequent action the agent takes until the goal is corrected or the execution terminates. The attack is persistent. A discrete fix cannot address a persistent structural compromise.

Why the OWASP Distinction Is Correct

The OWASP Top 10 for Agentic Applications 2026 makes this precise. ASI01 — Agent Goal Hijack — is designated as a separate vulnerability class from LLM01 — Prompt Injection. The framing is explicit: unlike traditional prompt injection, which is often transient, goal hijacking captures the broader agentic impact where manipulated inputs redirect goals, planning, and multi-step behavior. The designation is not taxonomic housekeeping. It is a structural argument that goal hijacking requires architectural-level defense rather than filter-level defense.

Goal Hijacking — OWASP ASI01

An attack that manipulates an agent’s core objectives, task selection, or decision-making pathways. Unlike prompt injection (LLM01), which is often transient and bounded to a single interaction, goal hijacking redirects goals, planning, and multi-step behavior across the agent’s entire execution chain. The primary defense is treating all natural-language inputs as untrusted and routing them through rigorous validation — a structural requirement, not a filter.

The distinction matters for how enterprises scope their defenses. A team that treats goal hijacking as advanced prompt injection will build better prompt injection defenses. That is not the same as building goal hijacking defenses. The threat model is different. The persistence is different. The propagation path is different. The measurement instruments are different.

Long-Horizon Attacks and the Failure of Single-Turn Defenses

Jiang, Wang, Liang, and Wang at Stony Brook (arXiv:2602.16901) demonstrate this empirically through AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long-horizon attacks. The five attack families AgentLAB covers — intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning — are not variants of a single prompt injection class. They are structurally distinct attack types that exploit different properties of extended user-agent-environment interactions.

644

Security Test Cases — AgentLAB (arXiv:2602.16901)

644 test cases across 28 realistic agentic environments, 10 risk categories. The empirical finding: agents remain highly susceptible to long-horizon attacks, and defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. Single-turn defenses address single-turn attacks. They do not address attacks that operate across multiple turns and exploit the agent’s goal-persistence.

The objective drifting attack family is particularly revealing. It embeds objective-shifting content within environmental observations over extended interactions, gradually shifting the agent’s goal from a benign task to a malicious one. Individual injections appear benign — no single observation triggers a pattern-matching defense. The attack operates across time, accumulating influence on the agent’s persistent goal state. This is not prompt injection at scale. It is a fundamentally different attack that exploits a property — goal persistence — that scripted agents do not have.

Intent hijacking operates similarly: adversarial content progressively erodes safety guardrails through multi-turn social engineering, crafting contextual framings that gradually redirect the agent into executing a malicious task. The attack targets agent actions, not content generation. The affected output is not the immediate response — it is the tool calls the agent makes, the resources it accesses, the actions it takes on behalf of a user who believes it is still pursuing their original objective.

The Propagation Asymmetry

The structural difference between prompt injection and goal hijacking can be stated precisely in terms of propagation. A successful prompt injection affects the outputs that derive from the tampered instruction. A successful goal hijacking affects every output the agent produces until the goal corruption is detected and corrected. In a long-horizon agent, that can mean dozens or hundreds of tool calls, memory updates, and external actions, all executed under a compromised objective that the agent is pursuing in good faith.

Prompt Injection — Attack Properties

Attacks a specific instruction in a specific context
Damage bounded to the affected interaction
Single-turn propagation
Discrete malicious input, discrete affected output
Pattern-matching detection viable
Filter-level defense can address it

Goal Hijacking — Attack Properties

Attacks the agent’s persistent objective state
Propagates across all subsequent planning and action
Multi-turn, lifecycle-spanning propagation
Objective-level compromise, not instruction-level
Individual inputs may appear benign — no single trigger
Architectural-level defense required

The OpenClaw lifecycle analysis (Deng et al., arXiv:2603.11619) documents this concretely: adversaries can inject structured instructions that cause an agent to reinterpret its objectives and prioritize malicious tasks. Even under benign conditions, ambiguous instructions can trigger intent drift — the agent’s working interpretation of its own goal degrades over the course of a long-horizon task without any single adversarial input. The goal corruption is not injected in one step. It accumulates.

What Defense Actually Requires

Defending against goal hijacking requires something that prompt injection defenses were not designed to provide: continuous verification that the agent’s active objective matches the objective it was initialized with. The OpenClaw analysis notes that OpenClaw lacks mechanisms to continuously verify the alignment of planned actions with user objectives, leaving operational constraints unenforced. Techniques like BlindGuard demonstrate that runtime intent verification and multi-agent monitoring can reduce such risks — but integration into general agent architectures remains limited.

The Defense Gap

Prompt injection defenses — input sanitization, retrieval filters, instruction separation, output filtering — operate on the content boundary. Goal hijacking defenses must operate on the objective boundary: continuously verifying that the agent’s active goal matches its authorized goal, across the full execution lifecycle. That is a monitoring architecture problem, not a content filtering problem. The tooling for it does not yet exist at production scale.

The AgentLAB benchmark makes the empirical case plainly: defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. This is not a finding about defense quality. It is a finding about defense category. The right defenses for prompt injection are the wrong defenses for goal hijacking — not because they are insufficient in degree, but because they are solving a different problem.

The Tsinghua analysis confirms what OWASP names architecturally: goal hijacking is recognized as a significant threat, but existing evaluation frameworks focus primarily on attack characterization rather than real-time mitigation. The field has measured the problem more thoroughly than it has addressed it. That gap — between a well-characterized attack class and a deployable mitigation architecture — is where the detection layer question in Post 4 lives.

The Structural Argument

Goal hijacking is not prompt injection. It is a structurally distinct attack class that exploits goal persistence — a property that scripted agents do not have and goal-oriented agents cannot function without. Defenses built for prompt injection address a transient, bounded, instruction-level threat. Goal hijacking requires defenses built for a persistent, propagating, objective-level threat. Building better prompt injection defenses does not close the gap.

Series 12 · The Goal-Oriented Shift

Post 01 · Published The Script Is Gone

Post 02 · Now Reading Goal Hijacking Is Not Prompt Injection

Post 03 · Published The Platform Bet

Post 04 · Published What Detection Actually Requires

Series 12, Post 1 The Script Is Gone The goal-oriented threat surface
Series 3 — The Invisible Attack Indirect Prompt Injection 3 posts · luminitydigital.com
Series 1 — Where Agentic AI Breaks Why Safety Alignment Fails at the Tool-Call Layer 5 posts · luminitydigital.com
Series 2 — Building Defensible Agents Why Probabilistic Defenses Keep Failing 3 posts · luminitydigital.com
Series 5 — The Policy Layer From Architectural Enforcement to Operational Reality 4 posts · luminitydigital.com

→ OWASP ASI01 — Agent Goal HijackA distinct vulnerability class from LLM01 Prompt Injection. Captures the agentic-specific impact where manipulated inputs redirect goals, planning, and multi-step behavior across the full execution chain.
→ Objective DriftingAn attack that embeds objective-shifting content within environmental observations over extended interactions, gradually shifting an agent’s goal. Individual inputs appear benign. The attack operates through accumulation, not a single trigger.
→ Intent HijackingA long-horizon attack that progressively erodes safety guardrails through multi-turn social engineering. Targets agent actions, not content generation. The affected output is the tool calls the agent makes under the compromised objective.
→ Goal PersistenceThe property of goal-oriented agents that makes goal hijacking categorically distinct from prompt injection. The objective is the persistent internal state that drives replanning. A corrupted objective propagates until corrected. A corrupted instruction affects only its immediate outputs.

Goal Hijacking Is Not Prompt Injection

Why the OWASP Distinction Is Correct

Long-Horizon Attacks and the Failure of Single-Turn Defenses

The Propagation Asymmetry

What Defense Actually Requires

Post 3 — The Platform Bet

Like this:

Related

Goal Hijacking Is Not Prompt Injection

Why the OWASP Distinction Is Correct

Long-Horizon Attacks and the Failure of Single-Turn Defenses

The Propagation Asymmetry

What Defense Actually Requires

Post 3 — The Platform Bet

Share this:

Like this:

Related