The workflow starts with Qwen3.5-35B-A3B planning six shots and character bibles from the input sentence, then uses FLUX.2 klein to paint canonical character portraits without LoRA training—reference editing pins identity across frames by construction. FLUX.2 renders per-shot keyframes in under a second after warmup. Wan2.2-I2V-A14B animates each keyframe at 81 frames and 16 fps native, the distribution the model was trained on; an earlier 121-frame 24 fps attempt produced temporal rippling. FLF2V anchors the last frame of one shot to the first frame of the next for seamless transitions.
A vision critic stage reloads the same Qwen3.5-35B with ten structured failure labels—character drift, extras invading frame, camera ignored, walking backwards, object morphing, hand artifacts, wardrobe drift, neon glow leak, stylized AI look, random intimacy—and re-renders bad clips with targeted retry strategies: different seed, FLF2V anchor, or prompt simplification. ACE-Step v1 generates a 30-second instrumental from the director agent's brief. Kokoro-82M handles narration in nine languages, with the director picking the language to match the setting (Tokyo gets Japanese, Paris gets French, Mumbai gets Hindi). The final mix uses ffmpeg with per-shot voice-over aligned via adelay.
The developer ran 1280×720 resolution instead of Wan's 640×640 default, accepting the higher cost to match production expectations. Flow shift was set to 5 for hero shots and 8 for b-roll. The negative prompt uses verbatim Chinese tokens from Wan's training configuration—umT5's multilingual pretraining was anchored to those exact tokens, and English translations perform observably worse. Camera instructions use one verb per shot, sentence-case, placed first: "Tracking shot following from behind."
Every model carries an Apache 2.0 or MIT license.