A new paper from researchers at Google DeepMind proposes Soft Mixture-of-Experts (Soft MoE), a novel sparse transformer architecture for image classification. The paper shows that Soft MoE significantly outperforms both standard Vision Transformers and popular sparse mixture-of-experts methods across various benchmarks.
Mixture-of-experts models aim to increase model capacity without increasing computational costs by routing different input tokens through different expert modules. However, most prior works rely on complex discrete routing algorithms that can be unstable and inefficient.
The key innovation in Soft MoE is a differentiable routing algorithm that mixes input tokens into weighted combinations before passing them to experts. This soft routing completely avoids the optimization challenges of discrete routing.
Experiments demonstrate Soft MoE’s capabilities:
- Soft MoE models strongly dominate Vision Transformers and other sparse methods on the Pareto frontier of performance vs training cost. For example, Soft MoE-B/16 outperforms ViT-L/16 while requiring 3x fewer FLOPs.
- When matched for training time, Soft MoE-B/16 surpasses ViT-H/14 on upstream metrics while being 5.7x faster at inference. This demonstrates massive gains in efficiency.
- Soft MoE-L/16 beats ViT-H/14 upstream while having 3x lower inference cost. The largest Soft MoE models substantially improve over all ViTs.
- Soft MoE benefits transfer to other modalities. A frozen Soft MoE image model paired with a text tower outperforms ViT counterparts on image-text retrieval.
The results clearly demonstrate that the continuous routing in Soft MoE enables scaling sparse Transformers beyond what was possible with prior discrete algorithms. The approach also simplifies training.
By unlocking much greater efficiency, Soft MoE could expand adoption of gigantic multimodal models. The ability to cheaply serve huge models after pre-training may prove particularly impactful.
While promising, there are still challenges to address before Soft MoE sees large-scale deployment. The reliance on model parallelism introduces complexity, and the approach does not directly apply to auto-regressive decoding tasks common in NLP. Nonetheless, Soft MoE represents a major advance that establishes new state-of-the-art benchmarks for sparse Transformers.