Monitoring ChatGPT Drifts Reveals Substantial Behavior Changes Over Time

A new paper by researchers at Stanford University and UC Berkeley reveals that the behavior of popular large language models (LLMs) like GPT-3.5 and GPT-4 can change substantially in just a few months.

The researchers evaluated the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on tasks like math problem solving, answering sensitive questions, code generation, and visual reasoning. They found major shifts in performance and behavior between these two releases.

For example, GPT-4’s accuracy on determining whether a number is prime dropped precipitously from 97.6% to just 2.4% between March and June. The researchers hypothesize this is due to issues with the “chain of thought” prompting approach failing to work properly in the June version.

GPT-4 also became less willing to answer potentially dangerous questions directly in June compared to March. While indicating improvements in safety, it also provided less explanation when refusing to engage.

One of the tasks the researchers evaluated was code generation – prompting the models to generate PythonImage code to solve programming problems. They found that the percentage of GPT-4’s generated code that could be directly executed dropped substantially from 52% in March to just 10% in June. GPT-3.5 exhibited a similar large decline in directly executable code between the two versions.

Upon closer inspection, the researchers determined that a key factor in this reduced executability was both models’ tendency to add extraneous non-code strings in the June versions that prevented successful execution. For example, GPT-4’s June version consistently added triple quotes before and after code snippets, while the March version did not. While a small change, those extra quotes resulted in syntax errors when trying to run the code. This demonstrates how even minor formatting shifts by an LLM over time can break downstream code execution and testing workflows that rely on generated code. The opaque changes to model behavior reinforce the need for continuous regression testing and monitoring.

The visual reasoning capabilities of both models showed marginal gains from March to June. But even on this task, GPT-4 surprisingly got some samples right in March but wrong in June.

Overall, the significant shifts uncovered on multiple simple benchmarks highlight the opacity around when and how models like GPT-3.5 and GPT-4 are updated. It also emphasizes the need for continuous monitoring and assessment as their behavior is not guaranteed to uniformly improve over time.

The implications are important for both individual users and companies relying on LLMs. Unexpected changes in accuracy, content, formatting, and more could break downstream workflows. The researchers recommend implementing ongoing tests tailored to your own applications.

As LLMs continue rapidly evolving, further research is critical to track their longitudinal progress, identify potential pitfalls, and elucidate best practices for stable integration. With LLMs being deployed in sensitive domains like medicine and law, maintaining rigorous oversight is crucial.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.