As artificial intelligence systems continue to advance, researchers are faced with a new challenge – how can we evaluate and debug AI decision-making abilities that surpass human expertise? A new paper from ETH Zurich proposes an intriguing solution: test for logical consistency.
The paper notes that while we can’t directly judge the correctness of superhuman AI judgments, we can check that they satisfy certain logical, human-interpretable rules. For example, a chess engine shouldn’t assign wildly different valuations to two identical board positions. The researchers refer to these as “consistency checks.”
The team demonstrates their framework on three applications where ground truth is difficult to assess:
- Chess engines: Despite superhuman play, leading chess AI Leela still made errors like valuing a position significantly lower after its own recommended move.
- Forecasting: Large language models like GPT-3.5 often produced probabilities that violated basic rules of forecasting, like the negation of two complementary events summing to 1.
- Legal judgments: When prompted for bail decisions, GPT-3.5 would sometimes approve bail for a defendant with a more extensive criminal record, defying common sense.
The consistency violations reveal inherent issues in the models’ reasoning, even if we can’t pinpoint precisely where decisions are wrong. While rare for common inputs, the researchers showed inconsistencies can be found more readily by adversarial search.
This technique of probing for logical failures could be key to ensuring reliability as AI takes on more impactful real-world planning and decision-making. Consistency may never guarantee safety, but can surface bugs that end-to-end testing would miss. The approach could complement developing inherently safe AI techniques.
Looking forward, key challenges are scaling consistency testing and making it robust to attempts at deception. More broadly, the paper illustrates the importance of developing AI evaluation methods that look beyond just accuracy metrics. When ground truth is unclear, consistency provides a promising path to debug superhuman intelligence.