Skip to content

Conversation

jessegross
Copy link
Contributor

Models that use sliding window attention can only resume a sequence from the cache if it falls within the saved windows. This works well if the next message picks up where the old one left off. However, it generally prevents a partial prefix match unless the entire conversation falls within the sliding window.

This can be a problem with reasoning models where the traces are supposed to be removed from future messages, forcing the entire history to be re-evaluated.

This change allows models to specify that a larger amount of the history be retained in memory, to allow more partial resumption. It still respects the window that the model was trained on for token generation.

Models that use sliding window attention can only resume a sequence
from the cache if it falls within the saved windows. This works well
if the next message picks up where the old one left off. However, it
generally prevents a partial prefix match unless the entire conversation
falls within the sliding window.

This can be a problem with reasoning models where the traces are
supposed to be removed from future messages, forcing the entire
history to be re-evaluated.

This change allows models to specify that a larger amount of the
history be retained in memory, to allow more partial resumption.
It still respects the window that the model was trained on for
token generation.
@jessegross jessegross merged commit 4183bb0 into main Jul 31, 2025
8 checks passed
@jessegross jessegross deleted the jessegross/swa branch July 31, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants