Google Announces KV Cache Compression Technology, Storage Demand Likely to Be Affected

robot
Abstract generation in progress

Large language models have always faced scalability issues. As the context window grows, the memory required for storing key-value (KV) caches increases proportionally, consuming GPU memory and reducing inference speed. To address this phenomenon, Google has introduced three compression algorithms: TurboQuant, PolarQuant, and Quantized Johnson-Lindenstrauss (QJL). These algorithms aim to efficiently compress caches without compromising model output quality.

US Stock Storage Sector Declines Across the Board

Google’s new compression technology has sparked market concerns about storage demand prospects. Following the news, memory manufacturer SanDisk’s stock price dropped as much as 9.2% on Wednesday, while Micron’s stock fell up to 6.3%.

The new memory compression technology, TurboQuant, can compress large model key-value caches down to 3 bits, achieving a sixfold reduction in memory and up to eightfold acceleration.

It is reported that TurboQuant can significantly reduce the cache memory footprint of large models without loss of accuracy. On NVIDIA’s H100 GPU, 4-bit TurboQuant is eight times faster than 32-bit unquantized keys when computing attention logic values. PolarQuant performs nearly lossless retrieval in “needle in a haystack” search tasks.

Morgan Stanley analysts pointed out that Google’s new compression technology only applies during inference and does not reduce hardware requirements. Instead, it may lower deployment costs and enable more AI applications.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin