The Truth Behind Claims of GPT-4’s Declining Performance
A new paper analyzing different versions of GPT-3.5 and GPT-4 has sparked intense debate over whether GPT-4’s capabilities have degraded over time. However, as AI experts Arvind Narayanan and Sayash Kapoor explain, the paper’s findings have been widely misinterpreted.
The key distinction is between a model’s capabilities and its behavior. A model’s capabilities refer to what it has the potential to do, as determined during the expensive pre-training process. Its behavior refers to what it actually does in response to prompts, as shaped by ongoing fine-tuning. Critically, fine-tuning is much cheaper than pre-training. It tweaks a model’s behavior without changing its core capabilities.
With this in mind, the paper’s results are not evidence of declining capabilities, but rather changing behavior. The authors tested GPT-3.5 and GPT-4 from March and June on math, sensitive questions, code generation, and visual reasoning. They found degraded performance on math and code generation.
However, upon closer examination, the math evaluation had a major flaw: it only tested prime numbers, not composites. In reality, all the models performed equally poorly, simply guessing “prime” or “composite” based on how they were fine-tuned. The code generation metric focused on whether code was directly executable, not its correctness.
When Narayanan and Kapoor tested the models on composite numbers, they found no meaningful performance differences. The appearance of decline stemmed entirely from the paper’s limited methodology.
Some speculate GPT-4 was intentionally degraded to cut costs. But there is no evidence for this. The findings are fully consistent with routine behavior drift from ongoing fine-tuning.
Nonetheless, the paper serves as an important reminder that fine-tuning can significantly impact behaviors on certain tasks. This poses challenges for developers aiming to build reliable products using APIs like GPT-3.5 and GPT-4. Seemingly minor tweaks can quietly break workflows optimized for specific applications.
In summary, while interesting, the paper does not show degraded capabilities in GPT-4. But it highlights the difficulties of quantitatively evaluating large language models, as well as the precarious nature of building on unstable foundations. We must maintain rigorous skepticism when assessing claims about AI progress or deterioration over time.