Llama 2: An Open Large Language Model Matching Proprietary Chatbots

A new large language model called Llama 2 was recently open-sourced by researchers at Meta AI. With up to 70 billion parameters, Llama 2 demonstrates capabilities on par with leading proprietary chatbots like ChatGPT, according to evaluations on over 4,000 test prompts.

Pretraining

The researchers pretrained Llama 2 on a mix of publicly available data totaling 2 trillion tokens. They increased the context length to 4,096 tokens, doubled from the previous Llama version, allowing the model to develop stronger long-term reasoning abilities. Llama 2 also utilizes grouped attention, improving inference scalability for the larger 34 billion and 70 billion parameter variants. On academic benchmarks, Llama 2 pretrained models outperform other open-source baselines like GPT-J across areas like commonsense reasoning and reading comprehension.

Fine-tuning

The researchers fine-tuned Llama 2 using a combination of supervised learning and reinforcement learning with human feedback to optimize the model for helpful and safe conversational responses. This aligns the model with human preferences and improves performance on dialogue tasks compared to existing open-source models like GPT-J and Anthropic’s Clara.

To create the conversational Llama 2-Chat models, the researchers first applied supervised fine-tuning (SFT) using high-quality human demonstrations of helpfulness and safety. This provided a strong initialization before the main reinforcement learning stage. For reinforcement learning, they iteratively collected human preference comparisons to train separate helpfulness and safety “reward models”. These reward models then guided the fine-tuning to align generations with human preferences. After 5 rounds of this iterative process, Llama 2-Chat matches proprietary chatbots in human evaluations assessing helpfulness and safety.

Interesting Observations

Llama 2 training revealed some interesting capabilities like temporally organizing facts and spontaneously using tools through conversational interfaces. The researchers also found that model performance continues improving with more reward modeling data, suggesting current annotation volumes are not yet saturating the models. They emphasize the importance of transparency and collaboration to build safer, more capable models.

Capabilities and comparison

Reasoning & QA: On commonsense reasoning benchmarks like HellaSwag and WinoGrande, Llama 2 outperforms other open-source models like GPT-J and Anthropic’s Claude. It is on par with GPT-3.5 on reading comprehension datasets like SQuAD and Natural Questions. However, there remains a gap compared to models like PaLM and GPT-4.

Coding: Llama 2 lags on coding benchmarks like HumanEval, scoring lower than GPT-3.5 and PaLM on program synthesis.

Knowledge: Llama 2 demonstrates strong world knowledge, outperforming Claude on the TriviaQA factoid QA dataset. Its knowledge still falls short of massive models like PaLM-2-XL fine-tuned on internet data.

Dialogue: After fine-tuning for dialogue, Llama 2-Chat matches proprietary chatbots like ChatGPT on human evaluations of helpfulness. It generates safer, more honest responses compared to GPT-3.5 on the safety benchmarks.

Summary

The open release of Llama 2-Chat could have significant implications for the field of conversational AI by providing an advanced dialogue agent that rivals proprietary models, but with transparency into its training methodology. This openness allows the community to build on the work and contribute to responsible development of large language models.

With further refinement, Llama 2-Chat may find widespread use in applications like customer service chatbots, personal assistants, tutoring systems, and more. Its capabilities in knowledge-intensive domains could also enable new smart search and recommendation features. Along with the ongoing work to ensure safety, Llama 2 helps democratize access to powerful conversational AI.

Llama 2: An Open Large Language Model Matching Proprietary Chatbots

Pretraining

Fine-tuning

Interesting Observations

Capabilities and comparison

Summary

Related

Leave a ReplyCancel reply

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage

Introducing the OWASP AI Testing Guide: A New Standard for AI Security Testing

The Low-Background Steel Problem of AI

Chinese AI Firms Dodge US Chip Bans with Cross-Border Data Smuggling to Malaysia

OpenAI open-sources a demo of a UI testing agent

Financial Dynamics in Agentic AI: Cursor’s Rise Versus GitHub Copilot