A new paper from researchers at Microsoft proposes a novel neural network architecture called Retentive Networks (RetNets) that could supersede Transformers as the go-to model for large language models.
Transformers have become ubiquitous in natural language processing, underpinning systems like GPT-3 and ChatGPT. However, they have some limitations around training parallelization and inference cost that make scaling up expensive.
RetNets aim to resolve these issues while matching or exceeding Transformer performance. The key innovation in RetNets is the retention mechanism, which allows the network to represent sequence data in multiple ways:
- Parallel retention – This works similarly to the standard Transformer self-attention. It projects the input sequence into queries, keys and values which are used to compute attention weights in parallel across the sequence. This enables fast, parallelized training.
- Recurrent retention – The same computation can be reformulated into a recurrent network, with a persistent state vector that accumulates information across sequence steps. This state vector allows for very efficient O(1) decoding, reducing inference cost.
- Chunkwise recurrent – For long sequences, the data can be divided into chunks which are handled recurrently across chunks but in parallel within each chunk. This balances computation and memory for efficient long sequence modeling.
The paper shows how parallel retention can be theoretically derived from a recurrent formulation. The key is making the query and key projections content-aware using the input itself, rather than fixed projections.
In experiments, RetNets achieved comparable results to Transformers on language modeling benchmarks. But they trained faster, reduced memory usage by 25-50%, and increased inference speed by up to 8X. RetNets also scale more gracefully – outperforming Transformers once models exceed 2 billion parameters.
The retention mechanism essentially gives RetNets the best of both worlds: parallelizable self-attention for training, and recurrent networks for inference. This could make developing and deploying massive models much more practical.
Microsoft tested RetNets up to 13 billion parameters, claiming they enable “efficient long-sequence modeling” beyond what Transformers can manage. This hints at their potential for tasks like summarization, question answering, and dialogue that require remembering fuller context.
The paper leaves some open questions around how far RetNets can be pushed. But given these initial results, the authors argue ConvNets were to computer vision what RetNets could be for natural language processing. Their properties seem ideal for training ever larger foundation models.
If RetNets live up to their promise, we may soon see them overtake Transformers as the standard architecture for language AI. The benefits for training massive models on huge datasets could accelerate progress in areas like reasoning and common sense.
Of course, the risks of such powerful models would also increase. But by making them dramatically more efficient, RetNets could help ensure the next generation of language AI develops responsibly.