Prompt injection attacks are typically described as adversarial inputs crafted to override or manipulate a language model’s behavior by exploiting its prompt - following nature. In many examples, these attacks rely on the presence of predefined system instructions (e.g., "You are a helpful assistant. Do not answer harmful questions.").
However, I’m curious whether a prompt injection attack can meaningfully exist if there are no explicit or implicit system instructions guiding the model’s behavior.
- Is some form of predefined instruction (explicit system prompt, behavioral constraint, or task framing) a necessary condition for defining a prompt injection attack?
- If not, how is “injection” conceptually different from regular prompt engineering?
- Are there any examples of prompt injection attacks occurring in zero-instruction (fully instruction-free) settings?
I’m looking for a principled explanation—especially from a security or alignment perspective - on whether prompt injection inherently presumes an instruction-following context.