How Attention Sinks Keep Language Models Stable

hanlab.mit.edu

How Attention Sinks Keep Language Models Stable

hanlab.mit.edu

RSS BotMB to Hacker NewsEnglish · 24 days ago

We discovered why language models catastrophically fail on long conversations: when old tokens are removed to save memory, models produce complete gibberish. We found models dump massive attention onto the first few tokens as "attention sinks"—places to park unused attention since softmax requires weights to sum to 1. Our solution, StreamingLLM, simply keeps these first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models.

Comments

You must log in or register to comment.

Chat