FashionChameleon swaps garments in real-time video at 23.8 FPS
New framework enables interactive garment customization in autoregressive video generation, trained on single-garment data but supporting multi-garment switching during inference at 30–180× faster than existing baselines.

Researchers have developed a system that lets users swap garments interactively during video generation while maintaining motion coherence — and does it at 23.8 frames per second on a single GPU.
FashionChameleon, detailed in a preprint released this week, addresses a longstanding bottleneck in human-centric video customization. Existing approaches require full re-generation to change a garment, making them impractical for e-commerce try-on flows or real-time content creation. The framework trains on single-garment video pairs but extends to multi-garment scenarios at inference time through three core techniques.
The first is a Teacher Model with In-Context Learning, trained on reference-garment pairs where the reference image and garment image intentionally mismatch. This forces the model to preserve motion coherence even when the garment changes mid-sequence, without requiring multi-garment training data. The second technique, Streaming Distillation with In-Context Learning, fine-tunes the teacher model using in-context teacher forcing and a gradient-reweighted distribution matching loss to improve consistency during long-form extrapolation. The third, Training-Free KV Cache Rescheduling, handles interactive garment switching by refreshing garment-related key-value cache entries, withdrawing outdated historical cache, and disentangling reference-image cache — all without retraining.
According to the paper, the system runs 30 to 180 times faster than existing baselines while supporting interactive customization and consistent long-video generation. Authors Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, and Xiaoyong Zhu detail the approach on arXiv.