New Benchmark Tests AI Agents in Real-World Challenges

Researchers from Tsinghua University, The Ohio State University, and UC Berkeley have introduced AgentBench, a new benchmark for evaluating large language models (LLMs) on their ability to act as intelligent agents in interactive environments.

The benchmark features 8 distinct challenges that aim to assess LLMs across a wide spectrum of real-world use cases, including operating systems, databases, knowledge graphs, games, puzzles, household tasks, web shopping, and web browsing. This is the first systematic attempt to evaluate LLMs as autonomous agents in such a diverse set of practical scenarios.

AgentBench evaluates LLM-as-Agent on a wide array of
real-world challenges and 8 distinct environments.

AgentBench goes beyond static datasets and closed environments by incorporating real-time interactions and open-ended action spaces. The challenges require models to demonstrate complex reasoning, planning, knowledge acquisition, and decision-making skills. For instance, in the knowledge graph domain, the LLM must query a large Freebase graph to answer natural language questions.

The researchers tested 25 LLMs on AgentBench, including top commercial APIs like GPT-4 and Claude as well as open-source models. The results revealed a significant performance gap, with GPT-4 achieving the highest scores across most environments. It successfully completed 78% of household tasks, demonstrating viability for real-world applications.

Typical LLMs’ AgentBench performance
(relative) against the best in each environment.

While state-of-the-art commercial LLMs showed strong agent abilities, open-source models generally struggled on the more complex domains. This highlights the need for advances in training methodologies and model scaling. The benchmark creators plan to continue evolving AgentBench to drive progress in this emerging field.

AgentBench provides a rigorous assessment of where LLMs currently stand as capable agents. The interactive nature and real-world focus differentiates it from existing static benchmarks. As LLM research continues rapidly advancing, benchmarks like this will be critical for systematically tracking progress on the path to artificial general intelligence. The diverse challenges encompassed by AgentBench establish a solid foundation for future work on developing LLMs that can act effectively across many situations.

Datasets, environments, and an integrated evaluation package for AgentBench are released at: https://github.com/THUDM/AgentBench

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.