A new study from researchers at UNC Chapel Hill and Google Research reveals that large language models (LLMs) like GPT-3 and BLOOM struggle to learn facts that rarely appear in their training data.
The researchers evaluated LLMs on question answering datasets like TriviaQA and Natural Questions. They found a strong correlation between a model’s ability to answer a question correctly and how many times information relevant to that question appeared in the pre-training data. Models were very inaccurate on questions about rare facts that were only represented a few times.
To identify relevant training data for each question, the researchers used an entity linking pipeline. This allowed them to efficiently search through massive pre-training datasets and find documents where the salient entities in the question and answer co-occurred.
The study compared LLMs ranging from 125 million to 176 billion parameters trained on datasets like the Pile and C4. Bigger models consistently performed better, but even the largest models struggled on rare instances. For example, the accuracy of BLOOM-176B jumped from 25% to 55% as the number of relevant pre-training documents increased from 10 to 10,000.
The researchers also showed the relationship was causal by removing relevant documents from the training data and retraining a model, which significantly hurt performance. In contrast, human accuracy was highest for rare questions since they tend to be simple facts.
The study suggests that simply scaling up models and datasets will not be sufficient for capturing rare knowledge. Models likely need to be scaled up by orders of magnitude. However, adding a retrieval module to find relevant documents can help reduce models’ dependence on pre-training data prevalence.
Overall, this research provides new insights into the inner workings of LLMs. Despite their impressive capabilities, these models are still limited by what they encounter during training. Going forward, developing methods to learn from limited data will be key to reaching human-level performance. The findings also raise concerns about potential biases caused by the prevalence of information on the web.