A new benchmark dataset called the Advanced Reasoning Benchmark (ARB) aims to push artificial intelligence systems to their limits in reasoning abilities across multiple domains. Created by researchers at DuckAI, Georgia Tech, ETH Zürich, Nomos AI, and other institutions, ARB focuses on problems in mathematics, physics, biology, chemistry, and law that require expert-level knowledge and reasoning.
The researchers argue that existing benchmarks for testing AI reasoning are becoming outdated as large language models like GPT-3 and GPT-4 achieve very high scores. ARB provides a more difficult test with graduate-level problems sourced from professional exams and resources.
Early results show current AI systems struggling on ARB’s quantitative problems, scoring below 50% on math and physics using standard prompting techniques. The researchers evaluated models including GPT-4, Claude, and others. Multiple choice questions posed less difficulty for the AI systems compared to short answer and open response questions.
To improve automated evaluation capabilities, the researchers proposed a rubric-based self-evaluation method where GPT-4 scores its own reasoning steps. While not yet reliable enough to replace human evaluation, the rubric scoring showed promise in coverage and correlation with human scores.
ARB provides a valuable new benchmark for pushing AI reasoning abilities, but the low initial scores highlight the limitations of today’s systems. Advances in reasoning are essential if AI is to augment professionals by automating narrow skills. For now, ARB shows general intelligence remains beyond the horizon. But benchmarks like this give researchers concrete targets, and each small improvement represents progress.