Scaling TransNormer to 175 Billion Parameters

The field of natural language processing has seen monumental advances with the rise of large language models (LLMs) like GPT-3. However, a key limitation of these models is their reliance on the quadratic self-attention mechanism which hampers their scalability and efficiency.

A new paper titled “Scaling TransNormer to 175 Billion Parameters” presents TransNormerLLM, the first linear attention-based LLM that outperforms conventional softmax attention models in both accuracy and efficiency. Developed by researchers from Shanghai AI Lab and OpenNLPLab, this model demonstrates superior performance across a range of sizes from 385 million to 175 billion parameters.

At the core of TransNormerLLM is the replacement of softmax attention with linearized relative positional encoding, enabling global interactions between tokens. To address the issue of attention dilution, the authors introduce LRPE-d which retains full attention at lower layers.

Additionally, the model incorporates several optimizations like Lightning Attention – an acceleration technique that speeds up linear attention by over 2x and reduces memory usage by 4x. Simplified gating and normalization further boost overall speed by 20%. The training process is scaled efficiently using advanced techniques like model parallelism, activation checkpointing and automatic mixed precision.

Rigorous experiments were conducted using a 6TB self-collected corpus with 2 trillion tokens. A self-cleaning strategy ensured only high quality and relevant data was used for pretraining. Across different sizes, TransNormerLLM consistently surpassed Transformer models in accuracy, training speed and maximum context length handled.

With the ability to scale up to 175 billion parameters smoothly while delivering faster training and inference, TransNormerLLM represents a pivotal breakthrough in efficient LLM architectures. The authors plan to open source the models to promote further research into efficient transformers for LLMs.

If validated at scale, such linear attention-based models hold immense promise to push the boundaries of what’s possible in language AI. Possible use cases include more powerful conversational agents, superior text generation abilities and multitask learning across diverse NLP datasets. With efficiency gains in training, TransNormerLLM could also democratize access to capable LLMs for a broader range of organizations.

As LLMs grow larger in scale, addressing efficiency and scalability will be key to continued progress. TransNormerLLM sets an important precedent in this direction – showcasing how architectural innovations can unlock unprecedented scales efficiently. This research highlights the exciting potential of efficient transformer designs to keep propelling the frontiers of language AI forward.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.