Forcing-KV cuts autoregressive video diffusion memory by 30%, hits 29 fps on H200
A new KV cache compression method for autoregressive video diffusion models reduces memory overhead by 30% and delivers up to 2.82× speedup at 1080p, enabling real-time long-horizon video generation on a single GPU.

Forcing-KV is a hybrid KV cache compression technique from researchers at Zhejiang University that tackles the memory bottleneck in autoregressive video diffusion models. The method divides attention heads into static heads—focused on transitions between autoregressive chunks and intra-frame detail—and dynamic heads, which handle inter-frame motion and temporal consistency. Static heads undergo structured pruning, while dynamic heads are pruned based on segment-wise similarity. The result is a 30% reduction in cache memory and generation speeds exceeding 29 frames per second on a single NVIDIA H200 GPU, with no loss in output quality.
The team tested Forcing-KV on LongLive and Self Forcing, two recent autoregressive video diffusion architectures that adopt a streaming generation framework for long-horizon synthesis. At 480p resolution, the method delivers 1.35× and 1.50× speedups respectively. At 1080p, the speedup reaches 2.82×, a meaningful gain for practitioners generating high-resolution video on consumer hardware. The paper notes that attention patterns and functional roles remain stable across samples and denoising steps, which allows the pruning strategy to be applied consistently without per-sample tuning.
Autoregressive video diffusion has gained traction for its ability to generate arbitrarily long sequences with real-time responsiveness, but the accumulation of key-value caches from historical frames has been a scalability ceiling. Forcing-KV's head-wise functional specialization offers a structured path around that ceiling, though the paper does not yet report results on models beyond LongLive and Self Forcing. Code and demo videos are available at the project page, and the preprint includes ablation studies on head-wise pruning ratios.
The next test will be whether the technique generalizes to other autoregressive architectures and whether the 30% memory saving holds at even longer sequence lengths. If it does, Forcing-KV could become a standard optimization pass for any AR video diffusion pipeline aiming to scale beyond a few hundred frames.