Recent advances in large language models (LLMs) like GPT-3 and GPT-4 have led to claims that these models can perform human-like reasoning and abstraction. However, new research indicates significant gaps still remain between LLMs and human reasoning abilities.
In a new study, researchers from the Santa Fe Institute evaluated text and multimodal versions of GPT-4 on a benchmark called ConceptARC. This benchmark tests understanding and reasoning related to basic concepts like above/below, inside/outside, and same/different.
The researchers first tested a text-only version of GPT-4 using more detailed prompting, including an example, than in previous work. This improved performance over simpler prompting, but GPT-4’s accuracy remained around 33% compared to 91% for humans on the 480 ConceptARC tasks.
To better compare GPT-4 with human performance on visual tasks, the researchers also tested GPT-4V, a multimodal version, on simplified “minimal” ConceptARC tasks presented as images. Surprisingly, GPT-4V performed significantly worse than the text-only GPT-4, achieving only 23-25% accuracy compared to 65-69% for text GPT-4.
The results reinforce that despite recent advances, leading LLMs still lack the robust abstraction abilities and flexible reasoning humans demonstrate even for basic concepts. The authors conclude that better prompting strategies could improve LLM performance, but fundamental gaps likely remain between human and artificial intelligence.
The ConceptARC benchmark provides a useful methodology for continued assessment of reasoning and abstraction capabilities as LLMs evolve. More work is needed to understand differences between human and artificial reasoning mechanisms and work towards more human-like learning and generalization.