Designing AI agents to resist prompt injection
OpenAI reframes prompt injection defense through the lens of social engineering risk management — designing systems where the impact of manipulation is constrained, even if attacks succeed.
The Evolution of Prompt Injection
Early prompt injection attacks were simple: edit a Wikipedia article to include instructions for visiting AI agents, and naive models would follow them. As models grew smarter, attacks evolved to incorporate social engineering tactics — making detection much harder.
The key insight: real-world prompt injection increasingly resembles social engineering more than simple prompt overrides. This shifts the defense problem from "identify malicious strings" to "resist misleading content in context."
Why Input Filtering Falls Short
Commonly recommended "AI firewalling" approaches — classifying inputs as malicious or benign — fail against sophisticated attacks. The problem becomes equivalent to detecting lies or misinformation, often without necessary context. A seemingly legitimate webpage or email could contain manipulative instructions.
The Customer Service Agent Analogy
OpenAI frames AI agents as existing in a three-actor system similar to customer service:
- The corporation wants the agent to act on its behalf
- The agent interacts with external input that may mislead them
- Third parties may attempt manipulation or deception
Just as human customer support agents have rules, limitations, and deterministic safeguards (refund caps, phishing flags), AI agents need similar architectural constraints.
Defense in Depth
OpenAI deploys a robust suite of countermeasures that combine traditional security approaches with the social engineering model. The goal is not just identifying malicious inputs, but designing agents so the impact of manipulation is constrained — even when individual attacks succeed.
Source: OpenAI Blog