Qwen vision + Z Image workflow automates Japanese film-style prompts in ComfyUI
A user-shared ComfyUI workflow pairs Qwen's vision model with Z Image generation to produce Japanese film-style imagery, with local and cloud deployment options.
A ComfyUI workflow combining Qwen's image captioning with Z Image generation surfaced this week, designed to produce Japanese film-style visuals through a two-stage prompt pipeline. The workflow uses Qwen's multimodal vision model to analyze reference images and generate detailed text prompts, then feeds those prompts into Z Image for final rendering. The creator describes the output quality as notably high and has tested the same prompts in MyJet, reporting visible differences in style and fidelity between the two generators.
The workflow automates the captioning-to-generation loop, removing the manual step of writing prompts from scratch. Users upload a reference image, Qwen extracts compositional and stylistic details, and Z Image interprets those details through its training on Japanese cinema aesthetics. The creator notes that swapping Qwen3 for a larger language model like Gemini or GPT-4 improves prompt specificity and output consistency, though the base workflow runs entirely locally with Qwen3.
Three deployment options are available: a local ComfyUI JSON workflow file, a cloud-based version on the RunningHub ComfyUI platform, and a Midjourney adaptation via TapNow. The creator says they'll open-source the underlying agent code if community response justifies the effort.
