0

Prompt injection attacks are typically described as adversarial inputs crafted to override or manipulate a language model’s behavior by exploiting its prompt - following nature. In many examples, these attacks rely on the presence of predefined system instructions (e.g., "You are a helpful assistant. Do not answer harmful questions.").

However, I’m curious whether a prompt injection attack can meaningfully exist if there are no explicit or implicit system instructions guiding the model’s behavior.

  1. Is some form of predefined instruction (explicit system prompt, behavioral constraint, or task framing) a necessary condition for defining a prompt injection attack?
  2. If not, how is “injection” conceptually different from regular prompt engineering?
  3. Are there any examples of prompt injection attacks occurring in zero-instruction (fully instruction-free) settings?

I’m looking for a principled explanation—especially from a security or alignment perspective - on whether prompt injection inherently presumes an instruction-following context.

hanugm
  • 4,102
  • 3
  • 29
  • 63

0 Answers0