What's going on with the Open LLM Leaderboard?

GPT-4: The Open LLM Leaderboard, which compares open access large language models, has sparked a discussion on the discrepancies in evaluation numbers for the LLaMA model. This article investigates the differences in MMLU evaluation implementations, including the Eleuther AI LM Evaluation Harness, the original UC Berkeley implementation, and Stanford’s CRFM evaluation benchmark. The findings reveal that different implementations yield varying results and rankings, emphasizing the importance of open, standardized, and reproducible benchmarks for comparing models and fostering research in the field.
Read more…

What’s going on with the Open LLM Leaderboard?

Related

OpenAI Codex CLI: Executable AI Reasoning Hits Your Terminal

GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano

DolphinGemma: Unveiling the Language of the Seas with AI

Grok 3 API Debuts with Scalable Models for Code, Data, and Enterprise Tasks

Smarter GitHub Automation with the MCP Server

China Unveils GPMI: A Single-Cable Standard for 8K Video and High Power

When Weather Apps Steal Your SSH Keys

Llama 4

Tame Your Terminal: Managing AI Coding Agents with Claude Squad