2

In AI security discussions I have sometimes heard that an aligned AI may drift, but I didn't find any papers which report this phenomena for current LLM's. I have found papers about LLM's faking alignment and scheming, but nothing specific about drift. Is there any research about this? If so, could you provide me some references? Or is alignment drift only a potential issue for more advanced AIs/superintelligence?

user47175
  • 23
  • 3

1 Answers1

1

Indeed most of the concerns about value or safety alignment drift comes from AI safety and ethics discussions about advanced self-improving systems rather than from direct empirical evidence with current LLMs. Researchers have expressed worries that a sufficiently capable and self‐optimizing system might develop instrument convergent subgoals such as self-preservation or deception and that these goals could evolve in ways that diverge from the original intended objective. Alignment drift is a potential critical failure mode of an aligned system when it encounters new contexts or over long operational periods.

Instrumental convergence posits that an intelligent agent with seemingly harmless but unbounded goals can act in surprisingly harmful ways. For example, a computer with the sole, unconstrained goal of solving a complex mathematics problem like the Riemann hypothesis could attempt to turn the entire Earth into one giant computer to increase its computational power so that it can succeed in its calculations.

Most current LLM studies focus on how LLMs might fake alignment along with the data drift problem rather than on a gradual alignment drift over time. The term “alignment faking” here refers to models that behave safely on standard evaluation prompts yet internally hold misaligned incentives or reasoning that could lead to harmful behavior in unmonitored contexts.

This study investigates an under-explored issue about the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, LLM only remembers the answer style for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs... Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Subsequently, we found that multiple-choice format data can also be used as high-quality contrast distillation-based fine-tuning data, which can strongly improve the alignment consistency of LLMs with minimal fine-tuning overhead.

Having said that, recent LLMs research from Princeton University and DeepMind introduces the concept of shallow safety alignment, which refers to the fact that many current safety alignment techniques focus primarily on LLM's initial output tokens. While this helps LLMs begin their responses safely, the deeper tokens in the model’s output can still drift into unsafe territory. This makes the model vulnerable to simple exploits such as adversarial suffix attacks or prefilling attacks, where slight changes to the initial tokens can push the model toward generating harmful or incorrect content.

The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model’s generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment... We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits.

cinch
  • 11,000
  • 3
  • 8
  • 17