LLM Sleep unlocks multi-hop reasoning in hybrid models without latency cost
Researchers propose periodic offline recurrence runs to compress context into SSM blocks before clearing attention KV cache, enabling multi-hop reasoning on evicted context without real-time latency penalties.

LLM Sleep is a training and inference framework that addresses a core weakness in hybrid attention-plus-SSM architectures: their inability to perform deep reasoning over context already evicted from active attention windows. Proposed by researchers Sangyun Lee, Sean McLeish, Tom Goldstein, and Giulia Fanti in a new preprint, the method runs N offline recurrence passes over the active context at regular intervals, consolidating information into structured state-space model blocks that act as fast memory, then clears the attention KV cache. The framework frames the offline phase as "sleep" — compute-intensive iterative reasoning happens during idle time, decoupled from the strict latency constraints of real-time token generation.
Standard hybrid models like Samba and Jet-Nemotron nominally support long contexts but struggle with multi-hop reasoning once earlier tokens fall outside the attention window. LLM Sleep splits the problem: memory consolidation depth moves to the offline phase, leaving online inference fast and shallow. This separation lets models handle complex queries over evicted context without ballooning generation latency — a practical win for practitioners running long-document workflows on hybrid architectures.
Offline consolidation and online inference
The framework triggers offline recurrence passes at configurable intervals during inference. Each pass iterates over the current context, updating SSM hidden states to encode relationships and dependencies that would otherwise require deep attention layers to reconstruct. Once consolidation completes, the system discards the attention KV cache, keeping only the compressed SSM state. Future tokens attend to a fresh, short KV window while querying the consolidated SSM memory for earlier context.
The offline phase can run during system idle time — between user turns in a chat session, during batch pauses, or on background threads — making the latency cost invisible in many real-world deployments. The preprint, posted to arXiv this week, does not release code or trained weights. Practitioners working with existing hybrid models will need to implement the periodic recurrence logic and retrain or fine-tune checkpoints to learn the consolidation behavior. The paper's core contribution is the training recipe and the empirical finding that offline passes unlock multi-hop reasoning quality previously gated behind prohibitively deep online computation.


