At what text-based tasks are "dumb humans" still better than the best language models?

Question

I ran into this AI-SE question from 5 years ago and I believe that an updated version could be interesting to discuss nowadays: Is the smartest robot more clever than the stupidest human?

Today's best LLMs are displaying a lot of human-like abilities: proficiency in natural languages, ability to code, logical reasoning, role playing and so on. They can even solve CAPTCHAs, design games, answer questions about stories or write new ones: these were the "shortcomings of robots" in 2018, according to the answers to the question I linked.

Question

How do the best LLMs of today compare to a "dumb human"? In what tasks are all normal humans still better than AIs? Is there any test that every able-bodied human would pass, but top LLMs would still fail?

Definitions and clarifications

A "dumb human" is a person without recognized disabilities or obvious problems, who doesn't have particular skills and who is considered not very intelligent (low IQ).

Of course the LLMs available to the public have a number of objective limitations: they can only process text-to-text, they work with tokens rather than characters, context length is just few kilo-tokens, and they have no long-term memory. However a number of open source projects have shown various solutions to these problems, and the non-public version of the commercial LLMs already support much larger context windows, image input and similar features. Observations like "LLMs can't move arms as they don't have it", "LLMs fail to count characters because they're token based", "LLMs can't speak nor listen to speech" are not interesting.

Looks like a duplicate of https://ai.stackexchange.com/questions/7021/is-the-smartest-robot-more-clever-than-the-stupidest-human?noredirect=1&lq=1. If it isn't please explain why not. — Bruce Adams, Jun 15 '23 at 18:40
@BruceAdams: that question is about general capacities of robots in 2018 and the answers say those robots aren't able to solve CAPTCHAs and write stories. This one is about text-based tasks of LLMs in 2023. — Blue Nebula, Jun 15 '23 at 19:49

bogovicj · Accepted Answer · 2023-06-19T19:40:55.183

LLMs seem to be limited at "compositional tasks." Have a look at this paper, in which the authors

investigate the limits of these models across three representative compositional tasks—multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer.

I don't know if a "dumb human" can do dynamic programming problems, but

humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively.

I gave ChatGPT two tries (what is "311 times 877") and ("what is 513 times 799"), and it got them both wrong.

Another example from that paper is a "zebra" or "einstein" puzzle. Look at page 18 in the paper for an example. (I'll copy it below) It took me a couple minutes to solve. ChatGPT tries its best, but gets it quite wrong.

the puzzle

There are 3 houses (numbered 1 on the left, 3 on the right). Each has a different person in them. They have different characteristics:

Each person has a unique name: peter, eric, arnold
People have different favorite sports: soccer, tennis, basketball
People own different car models: tesla, ford, camry

clues:

The person who owns a Ford is the person who loves tennis.
Arnold is in the third house.
The person who owns a Camry is directly left of the person who owns a Ford.
Eric is the person who owns a Camry.
The person who loves basketball is Eric.
The person who loves tennis and the person who loves soccer are next to each other.

Match the person, favorite sports, and car models to each house.

ChatGPTs output (for me just now)

GPT's full output

<omitting the output where it "thinks">
Therefore, the matching of people, favorite sports, and car models to each house is as follows:
House 1:
Person: Peter
Favorite sport: Tennis
Car model: Tesla


House 2:
Person: Eric
Favorite sport: Basketball
Car model: Ford


House 3:
Person: Arnold
Favorite sport: Soccer
Car model: Camry

and its easy to see that some of the constraints are violated. (e.g. the first one: Ford goes with Tennis)

Raul Alvarez · Answer 2 · 2023-06-18T09:48:07.630

Humans (any human) are still better at one of the most important task (whether it is a text-based task or not).... Humans can give you their own opinion or "sentiment" about a text. This means any human can tell you if he/she likes the text or not. An LLM don´t have any kind of own "personality", "opinion" or "feelings" so it can not give you its own opinion but a "general" sentiment (which is generally mentioned as Sentiment Analysis in NLP tasks) based on all the training data used by the bot.

At what text-based tasks are "dumb humans" still better than the best language models?

Question

Definitions and clarifications

2 Answers2

the puzzle

ChatGPTs output (for me just now)