iTryOn learns hand-garment contact from sparse video interactions
Researchers introduce Interactive Video Virtual Try-On, a framework that generates realistic garment swaps where subjects actively touch and adjust clothing using 3D hand priors and timestamped action captions.

iTryOn is a video diffusion transformer that generates virtual try-on sequences where people actively interact with garments—pulling sleeves, adjusting collars, tugging hems—rather than simply modeling static poses. Introduced in a paper posted May 21, 2026, by Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, and Xiaoyong Zhu, the framework formalizes Interactive Video Virtual Try-On as a distinct task and introduces a multi-level injection mechanism to guide generation through sparse, brief moments of contact. At the spatial level, iTryOn uses a garment-agnostic 3D hand prior to predict precise hand-garment contact points, sidestepping the semantic ambiguity that standard pose skeletons leave unresolved. At the semantic level, the framework ingests global video captions for overall context and timestamped action captions ("pulls sleeve at 2.3s") for localized interactions, synchronized via a novel Action-aware Rotational Position Embedding (A-RoPE) that aligns text tokens with video frames.
The architecture builds on a large-scale video diffusion transformer backbone, trained to learn complex garment deformations from video datasets where interactive moments are rare and fleeting. Experiments show iTryOn achieves state-of-the-art results on traditional non-interactive VVT benchmarks and establishes a commanding lead in the new interactive setting, where existing methods fail to preserve temporal consistency during contact.