A new paper from Anthropic researchers explores whether the reasoning that large language models (LLMs) provide alongside their answers is actually faithful to the model’s internal reasoning process. The goal is to evaluate whether the step-by-step “chain of thought” (CoT) explanations faithfully represent how the LLM is thinking.
The researchers evaluated LLMs trained to generate CoT reasoning before providing a final answer to questions. To test faithfulness, they proposed interventions like truncating the CoT midway or adding deliberate mistakes to see if the model’s final answer changes in response. The idea is that if the model relies on the stated reasoning, intervening on the CoT should impact the conclusion.
Experiments across different models and question-answering tasks revealed a wide variation in faithfulness. On some tasks, LLMs appeared to largely ignore the CoT when making their final prediction, suggesting post-hoc, unfaithful reasoning. But on other tasks like algebra word problems, modifying the CoT impacted the models’ answers, indicating stronger faithfulness.
Smaller models were found to produce more faithful CoT compared to larger models on the same tasks. This “inverse scaling” suggests that faithfulness decreases as LLMs get more capable at tasks. The researchers posit that smaller models may rely more on explicit reasoning from the CoT, while larger models can reach conclusions without it.
While this paper provides initial insights into reasoning faithfulness, further research is still needed to fully understand these phenomena in LLMs. With more work, we may be able to develop LLMs that provide trustworthy explanations alongside their predictions, enabling safer AI system deployment. The proposed interventions also give us tools to audit when LLMs are generating misleading explanations.