24

IBM Watson's success in playing "Jeopardy!" was a landmark in the history of artificial intelligence. In the seemingly simpler game of "Twenty questions" where player B has to guess a word that player A thinks of by asking questions to be answered by "Yes/No/Hm" ChatGPT fails epically - at least in my personal opinion. I thought first of Chartres cathedral and it took ChatGPT 41 questions to get it (with some additional help), and then of Kant's Critique of Pure Reason where after question #30 I had to explicitly tell ChatGPT that it's a book. Then it took ten further questions. (Chat protocols can be provided. It may be seen that ChatGPT follows no or bad question policies or heuristics humans intuitively would use.)

My questions are:

  1. Is there an intuitive understanding why ChatGPT plays "20 questions" so bad?

  2. And why do even average humans play it so much better?

  3. Might it be a future emergent ability which may possibly arise in ever larger LLMs?

I found two interesting papers on the topic

  1. LLM self-play on 20 Questions

  2. Chatbots As Problem Solvers: Playing Twenty Questions With Role Reversals

The first one answers some of my questions partially, e.g. that "gpt-3.5-turbo has a score of 68/1823 playing 20 questions with itself" which sounds pretty low.

Hans-Peter Stricker
  • 931
  • 1
  • 8
  • 23

5 Answers5

50

Like any other question on why ChatGPT can't do something, the simple/superficial answer is that ChatGPT is just a language model fine-tuned with RL to be verbose and nice (or to answer like the human tuners suggested), so they just predict the most likely next token. They do not perform logical reasoning like us in general. If they appear to do it in certain cases, it's because that's the most likely thing to predict given the training data.

The more detailed answer may require some months/years/decades of research that attempt to understand neural networks and how we can control them and align them to our needs. Model explainability has been around for quite some time.

ChatGPT is really just an example of how much intelligence or stupidity you can simulate by brute-force training.

Still, it's impressive at summarizing or generating text in many cases that are open-ended, i.e. there aren't (many) constraints. Again, this can be explained by the fact that what it generates is the most likely thing given what you pass to it. Example: If you say "Always look on the bright side of...", it will probably answer with "life". Why? Because the web or the training data is full of data that has the sentence "Always look on the bright side of life".

I don't exclude it's possible to train a model to perform logical reasoning correctly in general in this way, but so far it hasn't really worked. ChatGPT can really be stupid and informationally harmful. People are assuming that there's only 1 function that computes "intelligence". Nevertheless, I think the combination of some form of pre-training with some form of continual RL will probably play a crucial role to achieve "true machine intelligence", i.e. reason/act like a human, assuming it's possible to do this.

(I've been working with ChatGPT for a few months).

nbro
  • 42,615
  • 12
  • 119
  • 217
14

It Wasn't Trained To

A learning system performs best on the task for which it is given explicit feedback. That is the only time the parameters are updated and they are updated explicitly to maximize performance on that task. At no time did OpenAI, Google, or any other purveyor of LLMs admit to training their models on 20 Questions. The fact that it can play such games at all is a nice but unintended side effect of the model pre-training.

A human who is good at the game understands that optimal play involves bisecting the space of likely answers with each question. Without this insight, it is difficult to formulate an effective strategy that doesn't devolve to linear search. It's literally an exponential speedup. Humans who don't have this insight are also particularly bad at the game, and are likely to never reach your actual goal. So in some respects, we hold LLMs to an unreasonably high standard.

You Can Train It

On the other hand, one of the remarkable emergent behaviors is "in-context learning", meaning, you can teach the LLM something without updating any weights. Simply by describing something new, you can make it follow rules within a single "conversation" (the entire set of prompts and responses constitutes the "context"). For instance, you can teach it that a "snorglepof" is a sentence with an odd number of words that make reference to a gnome. Then you can ask it whether various sentences are a snorglepof or not, as well as ask it to produce sentences which are or are not snorglepofs (make up your own unique term/rules).

The fact that it is able to do this at all suggests to me that it has some kind of intelligence. An interesting task for you is to see if you can make it better at 20 Questions. The free ChatGPT runs on GPT 3.5 and has a context of 2048 tokens, which is a bit more or less than 1000 words (for both you and ChatGPT). If you explain the optimal strategy to it first, you might find that its performance improves relative to the naive play. For instance, you should start a new chat with something like this:

The optimal strategy for the game 20 Questions is divide and conquer. Each question should divide the space of possible answers in half. Questions which limit the size, material, and liveness of the target are typically effective. Now, let's play a game. I have thought of an object.

Even with this short prompt, I suspect that you will get better results. You can simply replay your former tests, using the exact same responses (where appropriate). If you give it example questions, it should also improve its play.

Analysis

While GPT and other LLMs appear to be super-human in their ability to manipulate language, one of their weakest areas appears to be reasoning. This is not surprising. Reasoning often requires search, which requires a potentially large amount of working memory. Unfortunately, LLMs have very little working memory (which might seem like a fantastical claim given that they consume upwards of 800 GB of RAM). The main problem is that they are almost all feed-forward architectures. Data gets a single pass through the system, and then they have to produce an answer with whatever they have.

GPT-3 has 96 transformer layers, which allows it to "unroll" a significant number of search steps that might be performed in a loop in a traditional algorithm. Even so, 96 loop iterations is pathetically small compared to something like AlphaZero, which can evaluate upwards of 80,000 board positions per second. I think it is safe to say that no amount of training will make GPT-3 competitive with AlphaZero in any game that it can play. In general, GPT-3 does poorly when it has to process something that requires a large number of operations (like adding up a long list of numbers). It is almost certainly because of this architectural choice.

Interestingly, language models prior to transformer architectures did use recurrence, which would theoretically give such models the open-ended performance horizon of systems like AlphaZero. However, they were mostly abandoned because researchers wanted the system to respond in a deterministic time, and recurrence limits the amount of parallelism which can be achieved. Perhaps future models will incorporate recurrence and get us closer to AGI. Some systems like AutoGPT attempt to add the recurrence externally to GPT, by putting it in a loop and feeding the output back into it, but they have met with quite limited (IMO, disappointing) success.

Lawnmower Man
  • 300
  • 1
  • 7
10

Because ChatGPT is not an artificial or synthetic intelligence, it's a large language model that possesses no intelligence in and of itself.

It's able to simulate the appearance of intelligence by tracking correlations between large numbers of objects, but it completely lacks understanding of what these correlations mean. Without understanding you cannot have reasoning, and without reasoning you cannot have intelligence.

Essentially ChatGPT, like all of the LLMs currently being hyped to death, is no more sophisticated than the chatbots we had in the 90s. Today's chatbots just happen to use much larger datasets, which allows them to more accurately simulate intelligence, but as you've already demonstrated it's child's play to shatter the illusion with any sort of questioning that requires a modicum of logical acuity.

Ian Kemp
  • 205
  • 1
  • 5
0

ChatGPT and the rest of LLMs do not have an understanding of any world concept, entity nor the relationship between them. As mentioned they use brute-force training to produce text.

Any ever larger LLMs following the same design (brute-force training to produce text) will show the same problems, issues... due to their lack of knowledge of the world.

Raul Alvarez
  • 132
  • 1
  • 12
0

This is something I studied quite extensively in this blog post:

https://evanthebouncy.medium.com/llm-self-play-on-20-questions-dee7a8c63377

the main take away is LLM do not (yet) have the capacity to do planning in a high dimensional space of all-words X all-sentences^20. coupled with the fact that no two 20 question games are identical, makes it difficult for a statistical model to do well.

Evan Pu
  • 101
  • 1