Causal Forcing++ cuts video generation to 2 steps, halves first-frame latency
New distillation method from Tsinghua and Shengshu AI achieves frame-wise autoregressive video generation in 1-2 sampling steps, beating prior 4-step chunk-wise methods on quality while cutting first-frame latency by 50 percent.

Causal Forcing++ is a diffusion distillation pipeline from Tsinghua and Shengshu AI that pushes autoregressive video generation down to 1-2 sampling steps per frame. The method, detailed in a preprint released this week, targets real-time interactive video — the kind of streaming, low-latency rollout needed for world models and interactive content. Where existing autoregressive diffusion distillation methods operate chunk-wise with 4 steps, Causal Forcing++ works frame-by-frame and cuts first-frame latency in half.
The team — Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, and Xinyuan Li — identified initialization as the bottleneck when moving to such aggressive few-step regimes. Existing strategies either misalign with the target distribution, can't handle few-step generation, or don't scale. Causal Forcing++ solves this with causal consistency distillation (causal CD), which learns the same autoregressive-conditional flow map as causal ODE distillation but obtains supervision from a single online teacher ODE step between adjacent timesteps. That design avoids precomputing and storing full PF-ODE trajectories, making the initialization both cheaper and easier to optimize.
Performance and efficiency gains
- Quality. Under a frame-wise 2-step setting, Causal Forcing++ beats the state-of-the-art 4-step chunk-wise baseline by 0.1 on VBench Total, 0.3 on VBench Quality, and 0.335 on VisionReward.
- Latency. First-frame latency drops by 50 percent.
- Training cost. Stage 2 training cost falls by roughly 4× thanks to the more efficient initialization.
- World models. The authors extend the pipeline to action-conditioned world model generation following the Genie3 approach.