A new study reveals that simply increasing the number of agents in an ensemble can boost the performance of large language models (LLMs) across a variety of tasks.
Researchers from Tencent Inc. conducted comprehensive experiments evaluating different ensemble sizes on benchmarks for reasoning, generation, and other capabilities. They found that using their simple “sampling-and-voting” technique to instantiate multiple agents consistently improved results for various LLMs, including the 13B-parameter Llama2-Chat and 70B-parameter Llama2-Chat from Anthropic, and OpenAI’s 7.5B-parameter GPT-3.5 Turbo.
Remarkably, smaller models like Llama2-13B could match or even surpass the performance of much larger models like GPT-3.5 Turbo by scaling up the number of agents to 15-20. The gains were especially pronounced on difficult reasoning tasks, with accuracy improvements of 6-24% on math problems and 1-11% on general reasoning.
For open-ended generation tasks such as code generation, the researchers used BLEU scores to quantify the similarity between each agent’s generated text. The BLEU metric calculates overlap between an agent’s text and all the other agents’ texts. The agent producing the text with the highest average BLEU score when compared to the others is deemed the “winner” for that round of voting. This text is considered the consensus response that is most representative of what the ensemble produced. So in tasks where the agents generate free-form text, voting is based on the text with the highest semantic similarity to the full set of responses, rather than simply picking the most frequent word-for-word output.
The researchers posit that larger agent ensembles help overcome errors and inconsistencies when models attempt complex multi-step reasoning. Each agent produces a potentially different response, and majority voting selects the most common coherent answer.
They also found ensemble methods compatible with other techniques like chain-of-thought prompting and agent debate frameworks, further boosting results. However, standalone ensembles proved highly competitive, achieving top accuracy with no extra prompting or debate training.
Analyzing task difficulty, the study uncovered how gains correlate with inherent problem complexity, reasoning steps, and solution probability. These insights led to customized sampling optimizations tailored to task properties.
The simple yet effective ensemble method could make deploying LLMs more affordable. Combining multiple smaller models may provide a lower-cost alternative to gigantic single models with billions of parameters.
If the approach scales up in practice, it could enable wider access to powerful AI assistants, reasoning tools, and generators. The researchers aim to reduce the computational expenses of ensembling in future work. But for now, their results suggest that when it comes to improving LLMs, more agents is all you need.