New technique can reduce memory needs of LLMs by 50x
MIT researchers have created "Attention Matching," a clever way to shrink the memory needs of large language models (LLMs) by up to 50 times, almost instantly.
Instead of hours of GPU work, this method uses quick math tricks to keep models running smoothly while preserving accuracy on evaluated benchmarks or causing only minimal accuracy loss.
Impressive results on benchmarks
By picking out only the most important bits from huge data caches, the team compressed an 8,000-token document from 1GB down to just 20MB.
Even at extreme compression levels—like reducing massive prompts or summarizing long texts—the models often preserved accuracy on evaluated benchmarks like QuALITY and LongHealth, with minimal degradation in many cases.
Potential to significantly reduce costs
This technique means you can run much bigger batches on less hardware—think five times more throughput and lower costs (about $0.50 less per million tokens).
For anyone building apps with AI, this could make things a lot more efficient and affordable.