ChatGPT and Claude are ‘becoming capable of tackling real-world missions,’ say scientists

Researchers from Tsinghua University, Ohio State University, and UC Berkeley have developed a tool, AgentBench, to measure the capabilities of large language models (LLMs) as real-world agents. The tool tests models’ abilities to perform complex tasks in various environments, such as operating within an SQL database and online shopping. The study revealed that top-tier models like GPT-4 significantly outperformed open-source models, indicating their potential for developing a potent, continuously learning agent.
Read more at Cointelegraph…

ChatGPT and Claude are ‘becoming capable of tackling real-world missions,’ say scientists

Related

Smarter GitHub Automation with the MCP Server

China Unveils GPMI: A Single-Cable Standard for 8K Video and High Power

When Weather Apps Steal Your SSH Keys

Llama 4

Tame Your Terminal: Managing AI Coding Agents with Claude Squad

Command Smarts: Exploring the Power of MCP Tools

Shingles Vaccine Linked to Lower Dementia Risk in Long-Term Study

DeepMind’s Silence: How Openness in AI Research Is Fading

Why Passwords Aren’t the Problem—But How We Use Them Is