A new approach called SPRING allows large language models (LLMs) like GPT-4 to achieve strong performance in the challenging Crafter benchmark without any training, simply by reading the academic paper describing the environment.
In the paper “SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning“, researchers from Carnegie Mellon University, Microsoft and other institutions propose a two-stage approach called SPRING.
First, the LLM reads the LaTeX source code of the Crafter paper paragraph-by-paragraph. For each paragraph, it determines if the content is relevant for gameplay. The relevant paragraphs are passed through several questions to extract key information like objectives, actions and dependencies in the game. This results in a concatenated context string summarizing useful gameplay knowledge.
Second, during actual gameplay, the current and previous visual frames are converted into text descriptions by a separate module. This observation text along with the context string is fed into a reasoning module based on a directed acyclic graph (DAG) of questions. The questions prod the LLM to logically think through the state, requirements, possible actions and priorities. By traversing the DAG, the LLM arrives at the best action to take. The text describing this action is converted to the environment’s action space.
The researchers evaluated SPRING in the procedural survival game Crafter, which requires multi-tasking, exploration and long-term planning. Without any training, SPRING with GPT-4 achieved a game score of 27.3 and reward of 12.3, comparing favorably to top reinforcement learning (RL) algorithms trained for 1 million steps like DreamerV3.
However, popular AI YouTuber Edan Mayer argues in his video review that the paper makes an unfair comparison to RL methods by limiting them to only 1 million training steps, whereas the pre-training of GPT-4 uses massive datasets. He notes that state-of-the-art DreamerV3 actually achieves over double the reward of 16 on Crafter when allowed to train for 10-20 million steps. Mayer suggests that comparing to the best RL methods without limiting experience would be more representative.
While SPRING demonstrates the ability of large LLMs to understand papers and perform reasoning when prompted appropriately, the comparison to RL algorithms constrained to 1 million steps may exaggerate its advantage. Still, the zero-shot transfer of knowledge from a paper to gameplay has exciting implications. The authors suggest SPRING could enable integrating human knowledge into RL training.
More broadly, the reliability shown in reasoning with technical documents opens up possibilities for LLMs to rapidly acquire expertise or assist human experts across fields. However, the approach still requires a manually designed visual observation module. Future work should investigate how to make systems like this more generalizable.