Unstructured Data & RBAC - Luminity Digital, Inc.

RBAC operates on a model of discrete, identifiable, persistently-owned objects — where a subject’s role grants or denies access at a known boundary. Agentic AI systems violate every one of these assumptions simultaneously, creating a fundamental architectural incompatibility between established security frameworks and AI agent runtime behavior.

Agentic AI systems retrieve unstructured content dynamically via vector search (which has no concept of access boundaries), synthesize information across multiple source documents, generate new unstructured artifacts that derive from potentially-restricted inputs, and pass context between sub-agents with ambiguous permission inheritance. The result is that no access control decision can be made at a single, discrete boundary — yet that is precisely what traditional RBAC requires.

NIST SP 800-53 Rev 5 access control family requirements directly broken by standard agentic AI unstructured data pipelines — AC-3, AC-6, AC-16, AU-10, and SI-10.

NIST AI RMF (GOVERN 1.1, MAP 1.5) acknowledges that AI systems introduce new risk surfaces that existing control frameworks were not designed to address. The intersection of unstructured data retrieval and agentic execution is one of the most concrete manifestations of that gap.

Structural Divergence: Structured vs. Unstructured Data

The properties that make unstructured data valuable to agents are precisely the properties that make RBAC enforcement unreliable.

RBAC Works

Structured Data

Discrete Object Boundaries Database rows and files have clear ownership metadata. AC-3 (Access Enforcement) can operate deterministically.

Persistent Classification Labels Sensitivity labels persist with the object. RBAC policies can be checked at read time against a stable classification.

Atomic Access Decisions A SELECT on a restricted table either succeeds or fails. No partial exposure. No semantic leakage.

Auditable Access Log Every access event is a discrete, attributable transaction. AC-6 and AU-2 can enforce and verify.

No Transformation at Read Data is returned as stored. The consumer receives exactly what the RBAC check permitted — no more.

RBAC Fails

Unstructured Data + Agents

Semantic Boundaries Are Invisible to RBAC A PDF may contain mixed-sensitivity content. A vector chunk may span a classification boundary. There is no discrete object the policy engine can evaluate.

Classification Is Lost After Chunking RAG pipelines split documents into vector chunks that lose provenance metadata. The chunk’s inherited classification is typically not preserved in the embedding store.

Synthesis Creates New, Unclassified Artifacts Agent outputs are newly generated content derived from multiple sources. The resulting text has no RBAC lineage — it cannot be automatically classified or controlled.

Aggregation Violates Least Privilege Individual document reads may each be within policy, but an agent synthesizing a summary across all accessible documents violates the spirit of least privilege (NIST SP 800-53 AC-6).

Agent-in-the-Middle Breaks Audit Trails The agent is the authorized accessor, not the end user. Attribution of what data the user effectively received requires agent-level observability that most deployments lack.

OWASP LLM Top 10 (2025) — Directly Implicated

LLM01 : 2025

Prompt Injection

Malicious instructions embedded in unstructured documents retrieved by agents can override access controls and redirect agent behavior — a direct RBAC bypass via data plane manipulation.

LLM02 : 2025

Sensitive Information Disclosure

Agents synthesize and return content derived from restricted unstructured sources. Even when individual document access is permitted, the synthesized output may expose restricted information to unauthorized users.

LLM06 : 2025

Excessive Agency

Agents granted broad retrieval permissions operate far beyond what a user would be permitted to access directly. The agent’s permissions become a privilege escalation vector if not scoped to the requesting user’s effective rights.

LLM08 : 2025

Vector and Embedding Weaknesses

Vector stores lack native access control enforcement. Similarity search operates across all embeddings regardless of source classification — a fundamental architectural gap.

Six Critical RBAC Failure Modes

Each represents a scenario where a technically-compliant RBAC check fails to prevent unauthorized information exposure.

Failure Mode 01

Critical

RAG Retrieval Boundary Violation

Vector similarity search retrieves semantically-related chunks without evaluating source document access rights at query time. An embedding generated from a classified document and one from a public document may be semantically indistinguishable — both surface in retrieval results. If the vector store was populated without per-chunk ACL metadata, there is no enforcement point.

OWASP LLM08 NIST AC-3 NIST AI 100-2

Technical Root Cause Embedding models are trained to capture semantic similarity, not security classification. The vector space has no concept of a trust boundary. RBAC filtering must be applied as a post-retrieval metadata filter — but this only works if source ACL metadata was preserved during the chunking and ingestion pipeline.

Failure Mode 02

Critical

Indirect Prompt Injection via Retrieved Content

Malicious instructions embedded in unstructured content retrieved by the agent are interpreted as legitimate instructions. An attacker with write access to any document in the retrieval corpus can inject instructions that cause the agent to bypass access controls, exfiltrate data, or perform unauthorized actions on behalf of the user.

OWASP LLM01 NIST SI-10 NIST AI RMF MAP 2.2

Technical Root Cause LLMs do not natively distinguish between trusted system instructions and untrusted retrieved content. The model processes all tokens in the context window with equivalent weight unless architectural separations are enforced. NIST AI 100-2e2023 §3.1 classifies this as an adversarial example attack on the model’s input processing.

Failure Mode 03

High

The Information Aggregation Problem

Each individual document retrieval may be within the user’s permitted access scope. However, an agent instructed to synthesize a comprehensive summary across all accessible documents produces an output that effectively aggregates information the user could never have assembled under normal operational constraints.

OWASP LLM02 NIST AC-6 NIST SP 800-188

Technical Root Cause NIST SP 800-188 formally defines the aggregation problem: combining individually non-sensitive data elements can produce sensitive inferences. RBAC has no native mechanism to evaluate the cumulative sensitivity of an agent’s context window — only the sensitivity of each discrete retrieval event.

Failure Mode 04

High

Agent Delegation & Permission Inheritance Ambiguity

In multi-agent architectures (orchestrator → sub-agent delegation), there is no standardized mechanism for propagating user-level RBAC context through agent chains. Sub-agents typically operate under service account permissions rather than the end user’s effective rights, creating privilege escalation paths.

OWASP LLM06 NIST AC-2 RFC 6749 NIST SP 800-207

Technical Root Cause OAuth 2.0 token delegation (RFC 8693) and Zero Trust patterns (NIST SP 800-207) provide the theoretical framework for propagating user context — but agent frameworks have not standardized on these patterns. The A2A Protocol (Google, 2025) begins to address agent identity, but RBAC context propagation remains implementation-specific.

Failure Mode 05

High

Temporal Access Drift in Vector Stores

Access control on source documents is time-variant — users gain and lose permissions, documents are reclassified, employees are offboarded. Vector store embeddings are point-in-time snapshots. If a document’s access controls change after its embedding is ingested, the embedding remains retrievable, bypassing updated source document controls entirely.

NIST AC-2(j) NIST IA-4 OWASP LLM08

Technical Root Cause NIST AC-2(j) requires review of accounts for compliance with account management requirements. Extending this to vector store embeddings requires treating each embedded chunk as an access-controlled resource subject to lifecycle management — a concept absent from current vector database architectures.

Failure Mode 06

High

Generated Output Classification Void

Agent-generated responses derived from restricted source documents carry no inherent classification. A synthesis of ten restricted documents produces an output with no security metadata, no access control policy, and no lineage to the contributing sources. This artifact may be stored, shared, or forwarded without any RBAC enforcement downstream.

NIST AC-16 NIST AU-10 OWASP LLM02

Technical Root Cause Data lineage tracking in AI-generated content is an unsolved problem at scale. NIST AU-10 (Non-repudiation) requires that outputs be traceable to their originating actions, but no current vector database or LLM orchestration framework provides automated classification inheritance for generated outputs derived from multi-source retrievals.

RAG Pipeline: RBAC Enforcement Gaps by Stage

Access control failures are not concentrated at a single point — they occur at every stage of the unstructured data retrieval pipeline.

→ Data Flow & Access Control Gap Analysis

Stage 1

Document Ingestion

ACL metadata typically not captured during chunking. Source IAM policy is decoupled from embedding pipeline.

→

Stage 2

Chunking & Embedding

Chunks may span classification boundaries. Embedding captures semantics, not security posture. Parent document ACL is not inherited.

→

Stage 3

Vector Store Indexing

Embeddings stored without user-scoped ACL metadata unless explicitly engineered. No native RBAC enforcement in most vector DBs.

Stage 4

Similarity Retrieval

ANN search returns semantically similar chunks regardless of source classification. User context not evaluated at search time without post-retrieval filtering.

→

Stage 5

Context Assembly

Agent accumulates chunks from multiple sources. Aggregation problem emerges. No RBAC check on cumulative context sensitivity.

→

Stage 6

Generation & Output

Generated artifact has no classification lineage. Output may be stored, shared, or transmitted without any access control enforcement downstream.

NIST SP 800-53 Rev 5 — How Agents Break Each Control

Direct mapping of established access control requirements to agent-specific failure modes with unstructured data.

Control

Standard Intent

Agent Failure Mode

AC-3: Access Enforcement NIST SP 800-53 AC-3

Enforce approved authorizations at access control decision points.

No discrete decision point in vector similarity search. Retrieval bypasses enforcement.

AC-6: Least Privilege NIST SP 800-53 AC-6

Grant only minimum access required for legitimate purpose.

Agents typically operate with broad service account permissions. Aggregation violates least privilege even within permitted scopes.

AC-16: Security Attributes NIST SP 800-53 AC-16

Associate security attributes with information and systems.

Chunking destroys security attribute inheritance. Generated outputs have no attribute lineage.

AU-10: Non-Repudiation NIST SP 800-53 AU-10

Provide irrefutable evidence of access and information origin.

Agent operates as intermediary. User-level attribution requires agent-level audit logging not standard in current frameworks.

SI-10: Input Validation NIST SP 800-53 SI-10

Check validity of information inputs to systems.

Agents pass retrieved unstructured content directly to LLM context. No native validation distinguishes legitimate content from injected instructions.

Recommended Compensating Controls

Because standard RBAC cannot be directly applied to agent unstructured data pipelines, compensating controls must be engineered at each architectural layer.

🔏

Per-Chunk ACL Metadata Enforcement

Preserve source document ACL metadata at chunk ingestion time. Apply ACL-based post-retrieval filtering before chunks are assembled into agent context. Treat ACL drift as a first-class pipeline event requiring re-ingestion.

NIST AC-3 OWASP LLM08

🪪

User-Context Token Propagation

Implement OAuth 2.0 Token Exchange (RFC 8693) to propagate end-user identity and effective permissions through agent delegation chains. Agents should never access resources under service account identity when acting on behalf of a user.

RFC 8693 NIST SP 800-207 NIST AC-2

🧱

Content Isolation & Injection Defense

Architecturally separate system instructions from retrieved content in the agent context window. Apply input validation to retrieved documents before LLM processing. Use structured output schemas to constrain agent behavior.

OWASP LLM01 NIST SI-10

📊

Context Window Aggregation Limits

Implement maximum retrieval volume policies that prevent agents from assembling context windows that would effectively grant access to the entirety of a user’s permitted corpus. Enforce purpose-bound retrieval scoping.

NIST AC-6 SP 800-188

🏷️

Output Classification Inheritance

Automatically classify generated outputs at the highest sensitivity level of any contributing source document. Apply access controls to generated artifacts before storage or transmission. Log generation provenance for AU-10 compliance.

NIST AC-16 NIST AU-10 OWASP LLM02

🔭

Agent-Layer Observability & Audit

Instrument agent runtime to log every retrieval event with source document identity, user context, and effective ACL state. Implement real-time anomaly detection on retrieval patterns that suggest aggregation attacks or injection sequences.

NIST AU-2 AI RMF GOVERN 4.2

RBAC Maturity Model for Agentic Unstructured Data Systems

A four-level progression from baseline compliance to semantically-aware dynamic access control.

Level 1 — Baseline
Service Account Scoping Only
Agent operates under a service account with scoped permissions to specific repositories
No user-context propagation through the agent
Retrieval is unrestricted within the service account scope
Minimum viable for internal tooling with low-sensitivity corpora
Level 2 — Structured
Post-Retrieval ACL Filtering
ACL metadata preserved at chunk ingestion
Post-retrieval filtering applied before context assembly
User identity passed to retrieval layer for filtering
Audit logs capture retrieval events with user attribution
Level 3 — Advanced
Token Delegation + Output Classification
OAuth 2.0 token exchange for user context propagation through agent chains
Generated outputs classified at highest source sensitivity
Injection detection on retrieved content
Aggregation volume limits enforced per session
Level 4 — Adaptive
Semantic Access Control + Continuous Evaluation
Real-time semantic classification of retrieved chunks
Dynamic policy evaluation against cumulative context sensitivity
Behavioral anomaly detection on retrieval patterns
Continuous re-authorization throughout agent session lifecycle

Key Insight

The fundamental challenge is not model intelligence but system reliability. Standard RBAC was never designed for systems that dynamically retrieve, synthesize, and generate across trust boundaries. Compensating controls must be architected at every stage of the agent pipeline — from ingestion through generation — because no single enforcement point can address the full scope of RBAC breakdown in agentic systems.

Unstructured Data and the RBAC Breakdown in Agentic AI Systems

Structural Divergence: Structured vs. Unstructured Data

Structured Data

Unstructured Data + Agents

OWASP LLM Top 10 (2025) — Directly Implicated

Six Critical RBAC Failure Modes

RAG Retrieval Boundary Violation

Indirect Prompt Injection via Retrieved Content

The Information Aggregation Problem

Agent Delegation & Permission Inheritance Ambiguity

Temporal Access Drift in Vector Stores

Generated Output Classification Void

RAG Pipeline: RBAC Enforcement Gaps by Stage

NIST SP 800-53 Rev 5 — How Agents Break Each Control

Recommended Compensating Controls

Per-Chunk ACL Metadata Enforcement

User-Context Token Propagation

Content Isolation & Injection Defense

Context Window Aggregation Limits

Output Classification Inheritance

Agent-Layer Observability & Audit

RBAC Maturity Model for Agentic Unstructured Data Systems

Related Resources

Unstructured Data and the RBAC Breakdown in Agentic AI Systems

Structural Divergence: Structured vs. Unstructured Data

Structured Data

Unstructured Data + Agents

OWASP LLM Top 10 (2025) — Directly Implicated

Six Critical RBAC Failure Modes

RAG Retrieval Boundary Violation

Indirect Prompt Injection via Retrieved Content

The Information Aggregation Problem

Agent Delegation & Permission Inheritance Ambiguity

Temporal Access Drift in Vector Stores

Generated Output Classification Void

RAG Pipeline: RBAC Enforcement Gaps by Stage

NIST SP 800-53 Rev 5 — How Agents Break Each Control

Recommended Compensating Controls

Per-Chunk ACL Metadata Enforcement

User-Context Token Propagation

Content Isolation & Injection Defense

Context Window Aggregation Limits

Output Classification Inheritance

Agent-Layer Observability & Audit

RBAC Maturity Model for Agentic Unstructured Data Systems

Related Resources

Share this: