Happy Horse 1.1 adds 1080p and nine-reference i2v, but users report minimal quality gains
Alibaba's Happy Horse video synthesis model reached version 1.1 this week with 1080p output, nine-reference image-to-video, and improved lip-sync, though early users report incremental gains over the prior release.
"The update is incremental, not transformative," practitioners working with Alibaba's Happy Horse video synthesis model said this week after the release of version 1.1. The new version processes up to nine reference images at once for image-to-video generation, outputs at 1080p (up from 720p in version 1.0), and includes a revised lip-sync module designed to reduce temporal drift when characters speak.
Happy Horse 1.1 is available through fal.ai's hosted API at fal.ai/models/alibaba/happy-horse/v1.1/text-to-video. Alibaba has not published weights or a technical paper for the model. It runs exclusively as a hosted service, with no local inference option and no stated plans for open-weight release. Pricing and context-length limits are set by the fal.ai platform; Alibaba's model card does not specify training data, parameter count, or architecture details beyond the nine-reference input design.
The muted reception among practitioners reflects a broader pattern in video synthesis: resolution bumps and multi-reference input are table stakes, but perceptual quality—motion coherence, temporal consistency, artifact suppression—remains the harder problem. Users who have run the new version report that output quality tracks closely with 1.0, with the resolution increase offering sharper frames but no visible improvement in motion realism or prompt adherence.




