Alibaba's MACE-Dance generates synchronized dance video from music using dual experts
MACE-Dance, accepted to SIGGRAPH 2026, uses a mixture-of-experts approach to generate dance video from audio — a BiMamba-Transformer motion expert for 3D pose synthesis and a Wan-Animate appearance expert for video rendering.
Alibaba's MACE-Dance, accepted to SIGGRAPH 2026, generates full dance video from music by splitting the task across two specialized experts. The system takes an audio track and outputs synchronized dance footage, decoupling motion synthesis from visual rendering. Code and model weights are available on GitHub and HuggingFace.
The Motion Expert combines BiMamba and Transformer layers to produce 3D skeletal animation from the audio input. BiMamba processes local temporal dependencies in the audio-motion mapping, while the Transformer layer captures cross-modal context between audio features and pose sequences. The result is a frame-by-frame 3D pose sequence locked to the beat and phrasing of the track.
Appearance synthesis
The Appearance Expert builds on Wan-Animate, an open-weight video synthesis model, to render the final frames. It takes the 3D poses from the Motion Expert and a reference image of the dancer, then generates video that preserves visual identity and spatiotemporal coherence across the sequence. The mixture-of-experts design keeps motion generation and appearance synthesis independent, allowing each expert to specialize without interference.
Both experts' pretrained weights are hosted on HuggingFace, with inference scripts available on GitHub. Full benchmark comparisons and dataset details have not yet been disclosed.
