OpenAI has released a new lightweight library aimed at evaluating language models, starting with the introduction of `gpt-4-turbo-2024-04-09`. This initiative is part of OpenAI’s commitment to transparency regarding the accuracy of its models. The library focuses on a zero-shot, chain-of-thought evaluation setting, which is believed to more accurately reflect the models’ performance in real-world applications. Unlike other evaluation repositories, this one will not be actively maintained for new evaluations but will accept bug fixes, adapters for new models, and updates to evaluation results for new models and system prompts.
The repository includes evaluations for various benchmarks such as MMLU, MATH, GPQA, DROP, MGSM, HumanEval, and MMMU, covering a wide range of language understanding and problem-solving capabilities. Sampling interfaces for OpenAI and Claude APIs are provided, with setup instructions for each evaluation and sampler detailed within the repository. The benchmark results showcase the performance of different models, including various versions of GPT-4 and Claude, across these evaluations.
This library is not intended to replace the comprehensive collection of evaluations at OpenAI’s main evals repository but serves as a transparent and focused approach to showcasing model performance. Contributors to this repository must agree to license their evaluations under the MIT license and ensure they have the rights to any data used.
Read more at GitHub…