CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution
Abstract
CausalArmor is a selective defense framework for AI agents that uses causal ablation to detect and mitigate Indirect Prompt Injection attacks by identifying dominant untrusted segments and applying targeted sanitization.
AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.
Community
I'm excited to share our latest work to defend Prompt Injection: "CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution".
CausalArmor, a selective defense:
š§ Causal attribution at privileged actions: measure whether the action is driven by the user request vs. each untrusted span.
šÆ Intervene only on dominance shift: if an untrusted span dominates, sanitize just that span and re-generateāno always-on heavy filtering.
ā” Practical outcome: strong protection without affecting the benign interactions.
Results: Near-zero attack success while keeping benign utility and latency close to āNo Defenseā on prompt injection benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit (2026)
- Defense Against Indirect Prompt Injection via Tool Result Parsing (2026)
- AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management (2026)
- RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse (2026)
- AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System (2026)
- Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks (2026)
- ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper