ComfyUI workflow chains Qwen3VL captioning with ZIT for character LoRA transforms

A ComfyUI workflow converts any input image into a character-specific output using ZIT, Qwen3VL captioning, and LoRA fine-tuning—running under 60 seconds on a 12GB RTX 4070 Super.

May 12, 2026

ComfyUI workflow chains Qwen3VL captioning with ZIT for character LoRA transforms

A ComfyUI workflow automates character transformation by chaining Qwen3VL captioning with ZIT image-to-image generation and LoRA-based refinement. The workflow takes any input image, generates a text prompt via Qwen3VL, applies a character LoRA at multiple resolutions, and outputs a stylized version in under a minute on an RTX 4070 Super with 12GB VRAM. The workflow JSON is available on Pastebin.

The pipeline runs in three stages. First, the input is downscaled to 768 pixels on the long edge and fed to Qwen3VL, which generates a base prompt; a denoise range of 0.45–0.55 works best at this step. Second, the latent is upscaled 2× and the character LoRA is reapplied—this two-pass approach produces cleaner results than single-pass text-to-image. Third, SAM3 detects the face region, which is refined via ComfyUI's Inpaint Crop node with the LoRA applied again, followed by a light sharpening pass. The workflow includes a group bypasser node so users can toggle the upscale and face-fix stages independently; the final image is saved only after stage three completes.

The workflow is built for ZIT—a fast local diffusion model—but can be adapted to any checkpoint by swapping the VAE and CLIP loaders. A text concatenate node lets users prepend the LoRA trigger word and additional prompt fragments before Qwen3VL's auto-generated caption. A recent update removed the WAS Node Pack dependency and switched to ZIT's native VAE and CLIP, simplifying setup for new users.

The workflow demonstrates how third-party ComfyUI nodes can wire together multimodal captioning, iterative refinement, and region-based inpainting without custom code. Whether similar workflows will ship as native ComfyUI features—or remain community-maintained JSON—depends on how quickly the core team formalizes multi-stage pipelines and face-detection hooks in the next few releases.

More in Community