In the YouTube video “Sparse LLMs at inference: 6x faster transformers! | DEJAVU paper explained,” the speaker delves into the Deja Vu paper, which introduces a method to make sparse Large Language Models (LLMs) during inference while maintaining their performance. Transformer-based LLMs are known to be slow due to expensive self-attention and MLP layers. The Deja Vu paper’s inventions aim to make LLMs more efficient by making them sparse, although the challenge lies in the optimization of modern hardware for dense networks. The authors of the paper propose a contextual sparsity approach to turn off components based on input, preserving the model’s in-context learning ability. The sparsity method resulted in significant runtime cost savings for OPT-175B, making it 6 times faster than the standard transformer implementation. The authors of the paper also observed contextual sparsity in attention blocks, comparing it to mean-shift clustering. Each self-attention head performs one mean-shift clustering step, pushing tokens together and creating heavy-hitter attention heads representing important interactions. Heavy hitters display contextual sparsity as they cater to specific input needs, and this behavior is linked to mean-shift clustering, where denser regions gain more weight, resulting in stronger bonds and accumulating higher attention scores.