27

Do developers not genuinely want to prevent them? It seems like if they are able to develop such impressive AI models then it wouldn’t be that difficult to create catch-all/wildcard mitigations to the various jailbreak methods that are devised over time.

What gives? (i.e., what is so difficult about plugging these holes, or why might the developers not really want to do so?)

TylerDurden
  • 389
  • 1
  • 3
  • 7

4 Answers4

45

"Jailbreaks" work for a variety of reasons:

  • A lot of the setup that turns an LLM instance into a polite, well-behaved chatbot is actually just a hidden piece of starting text (a "pre-prompt" or a "system prompt") that the LLM processes in the same way as user input - the system text will always be inserted first, so sets context for how other text is processed, but is not otherwise privileged. There are other components and factors involved, but the LLM at the centre of it all remains a text-prediction engine that works with the text it has seen so far. When it processes multiple conflicting wording and rules, the core system does not always have an easy way to prioritise, and can decide to base predictions on new instructions instead of old ones.

  • Many "Jailbreaks" are creative in that they obey the letter of the law from the pre-prompt and training rules, but re-frame a conversation into a place where issues that would be blocked by rules are no longer valid. A very common jailbreak theme is to get the chatbot to respond as if it is writing fiction from some imagined perspective that is not its assigned identity.

  • It is very hard to detect and block jailbreaks without also blocking uses that are intended or supported. The task is not dissimilar to trying to control a conversation between two people by writing down a list of rules for one of them to follow, that they can consult when they answer. The rules have to be simple and objective so they can be followed, but a conversation can progress in many ways which make it tricky to process whether a rule applies. For example conversation topic can become subjective, and/or allegorical, it can consist of asides and multiple layers etc.

  • LLMs are very complex internally, and driven by an amount of data that is next to impossible for a human to navigate. The developers cannot exert detailed control on the models - we're still in the phase of not fully understanding how an LLM can perform some of the types of processing that it does. These things are being unpicked in published papers, but the work is not complete.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
26

Let's step back for a moment and consider your assertion:

It seems like if they are able to develop such impressive AI models

This implies that you are thinking of these models as being programmed in the traditional sense. That is, a development team coded its abilities into it using some means or another. At a very fundamental level, that is a completely incorrect way to think about LLMs. A language model is not 'developed'. It is 'trained'. That is, they are fed a bunch of inputs such as texts which is processed into a mathematical model. This model is not crafted by anyone. It 'emerges' from the relationships inherent in the content of the inputs.

The fact of the matter is that no one really fully understands exactly all the relationships that are encoded in these models. And we know that the data that fed things like ChatGPT is full of incorrect, biased, and unsavory content. The 'development' part of this is putting in rules to try to prevent those parts from surfacing during use. But because no one really knows where all the bad parts are, they can't know they have accounted for all of them. It's a bit of a whack-a-mole problem.

One way this manifests is that ChatGPT 4 is reportedly easier to jailbreak than ChatGPT 3.5. This makes sense if you consider the above.

I have an analogy that I'm not sure about, but I'll give it a go: when we consider large sauropods we tend to think of them as herbivores. But in actuality, they were omnivorous because it's nearly impossible to take a gigantic bite of a tree that doesn't contains some bugs or other animals. And the larger the dinosaur the more likely it's going to get things other than leaves in its diet. That's a little like the situation as these LLMs grow. The larger they are and more they 'suck in', the harder it is to curate what they are 'consuming' and manage what they learn from it.

8

"Jailbreaks" work because there are no "jails" in the AI model.

The models are enormous collections of interconnected statistics. Your "prompt" starts the model to following a path through its statistics to generate a new text.

To try and prevent generation of unwanted things, the operators prepend a bunch of text to your prompt in order to influence where the path starts.

The text you actually use as your prompt can then push the start of the path around. With a little care, you can push the path to a place that delivers the results you want - regardless of what the AI operator has prepended to your text.


Look at a large language model (LLM) chatbot as a hedge. There's thousands upon thousands of bush branches interconnected to make up a hedge. You can start at some leaf on the hedge and follow it into the hedge to further branches and trunks and leaves.

The text you give the LLM picks an external leaf on the hedge. The LLM then follows that leaf into the hedge, where each junction is a word (or word fragment.)

The text that is prepended to your prompt (that attempts to avoid bad things) pushes your starting point around the hedge to somewhere that has fewer bad things inside it.

Your prompt can then push the starting point around the hedge to start in a place that lets you get access to the bad things that the operator wants to avoid.

The only somewhat effective way to prevent the chatbot from producing bad outputs is to eliminate the bad stuff from the stuff that was used to generate the model in the first place - and even that won't prevent all of it.

JRE
  • 181
  • 3
2

The reason I understand behind jailbreak are as follows:

  • AI models are complex Mathematical and statistical formulas with data mapped to them based on probability.

  • They do not understand the content as we do. It only analyses the patterns in data and generates a response based on probabilities. That means it doesn't know that it is being manipulated. (which we humans also don't know sometimes [sarcasm]).

  • Models are easily tricked by some clever wordplay. When those requests or context is processed, the model doesn't know the intent of the user.

  • Also, the data on which it is trained plays a major role. The training data of GPT includes wide range of Human-generated text and some of the training data is still there after filtering of harmful content or inappropriate. This harmful content contains some context that allows humans to play with it.

  • Those who wants to bypass the restriction need to find those contexts and words with trial and error to exploit the weakness of the model.

  • As and when the model updates some of those vulnerabilities will be removed or patched and may be added.

Hiren Namera
  • 785
  • 6
  • 20