AnyFlow improves video diffusion quality as sampling steps increase
AnyFlow, a new distillation framework from MIT and Meta researchers, optimizes the full ODE trajectory instead of fixed endpoints, enabling video diffusion models to improve with more sampling steps rather than degrade.
AnyFlow is a distillation framework from MIT and Meta researchers that addresses a fundamental limitation in few-step video generation. While consistency distillation has accelerated sampling, models trained that way typically perform worse when users allocate more steps at inference—the opposite of what practitioners expect from diffusion models. AnyFlow fixes this by distilling the full probability-flow ODE trajectory rather than collapsing it into a fixed endpoint mapping.
The core shift is from endpoint consistency (mapping noise directly to clean output in one jump) to flow-map transition learning across arbitrary time intervals. Instead of training a model to leap from timestep t to timestep 0 in a single bound, AnyFlow learns to navigate from t to any intermediate r, preserving the original ODE's test-time scaling behavior. The team introduces Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling on-policy distillation that cuts discretization error in few-step sampling and exposure bias in autoregressive video generation. Ablations span bidirectional and causal architectures at 1.3B, 7B, and 14B parameters.
In the few-step regime—typically 4 to 8 steps—AnyFlow matches or beats consistency-distilled baselines. The decisive advantage appears when users push beyond 8 steps: AnyFlow continues to improve, while consistency models plateau or regress. The authors report gains across VBench metrics at 14B scale, with the gap widening as step budgets increase. The framework works with both bidirectional diffusion transformers (for single-shot generation) and causal models (for long-form video), suggesting it generalizes across architectural families.
The next question is whether AnyFlow's on-policy training overhead—backward simulation adds compute during distillation—pays off in production settings where inference cost dominates. The paper does not publish wall-clock distillation times or memory footprints, so practitioners will need to benchmark those when weights drop. If the training cost is manageable, AnyFlow could become the default distillation recipe for any video model where users want the option to trade more steps for higher quality at test time.
