Researchers from Meta, MIT, and CMU have developed a new framework, “StreamingLLM”, to improve the performance of large language models (LLMs) in long conversations. The solution involves reintroducing “attention sink” tokens, which LLMs focus on early in a conversation, to maintain high-quality responses even when the conversation exceeds the model’s pre-training sequence length. This could enable LLMs to handle infinite-length text without fine-tuning, potentially revolutionizing applications like customer service chatbots.