Unlocking LLM superpowers: How PagedAttention helps the reminiscence maze

September 11, 2025

28

1. Reminiscence fragmentation

Inner fragmentation

Programs pre-allocate a big chunk of reminiscence for every request, assuming the utmost potential output size (e.g., 2048 tokens). Nevertheless, if a request solely generates a brief output, a lot of that reserved reminiscence goes unused, resulting in important waste.

Exterior fragmentation

As a result of totally different requests reserve chunks of various sizes, the GPU reminiscence turns into scattered with unusable small gaps, making it laborious to suit new requests even when complete free reminiscence is obtainable. Our sources present that in current methods, solely 20.4% – 38.2% of KV cache reminiscence is definitely used to retailer token states, with the remaining being waste.

Superior decoding methods like parallel sampling or beam search usually generate a number of outputs from a single immediate, which means they might share components of the KV cache. Nevertheless, current methods can not simply share this reminiscence as a result of every sequence’s KV cache is in its personal separate, contiguous block.

Previous articleFormer Waze CEO Noam Bardin bets on Flytrex drone supply

Next articleCBRS disaster – technical points, hatchet jobs, institution fix-up?

Unlocking LLM superpowers: How PagedAttention helps the reminiscence maze

1. Reminiscence fragmentation

Inner fragmentation

Exterior fragmentation

Agentic cloud ops with the brand new Azure Copilot

Snowflake to accumulate Choose Star to reinforce its Horizon Catalog

Google updates Gemini API for Gemini 3

LEAVE A REPLY Cancel reply

Most Popular

Podcast: Is the related automobile revolution lastly right here, or are we nonetheless caught in impartial?

Temu expands European supply community

Saving password in passwords app is NOT working if I’ve password and ensure password textfield Swift IOS 26

Recreation Improvement on the PICO-8 with Johan Peitz

Recent Comments

ABOUT US

POPULAR POSTS

Podcast: Is the related automobile revolution lastly right here, or are we nonetheless caught in impartial?

Temu expands European supply community

Saving password in passwords app is NOT working if I’ve password and ensure password textfield Swift IOS 26

POPULAR CATEGORY

Unlocking LLM superpowers: How PagedAttention helps the reminiscence maze

1. Reminiscence fragmentation

Inner fragmentation

Exterior fragmentation

2. No reminiscence sharing

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY