Wan 2.2 In ComfyUI: What You're Actually Building
A ComfyUI workflow for Wan 2.2 takes a text prompt or input image, runs it through the Wan diffusion transformer via the ComfyUI-WanVideo custom-node package, and outputs an mp4 clip. The 14B A14B variants need 48 GB VRAM at fp8 quantization; the TI2V 5B variant runs on 24 GB. That's the whole shape of the build — the rest of this guide is plumbing.
Wan 2.2 is the open-weights video model from Alibaba's Wan-AI team, released late 2024 and refined through 2025. It comes in three checkpoints with different trade-offs between VRAM, speed, and capability. The reference workflows ship from the Wan-AI team and from kijai's ComfyUI-WanVideoWrapper, the de-facto custom-node package the community settled on.
Why Local Wan 2.2 Still Matters In 2026
In October 2025, ArtificialAnalysis confirmed on X what the Wan team had been telegraphing for weeks: Wan 2.5 ships closed-weights. Alibaba pivoted the 2.5 line to a SaaS/API product. No checkpoint download, no local inference, prompt logging by default, content moderation enforced at the API layer. The 2.5 release is the moment Wan stopped being open.
That makes Wan 2.2 the open ceiling for local video generation through 2026. Tencent's HunyuanVideo is the only competing 80GB-tier open video model and it's heavier to run; LTX Video covers the 16 GB tier but trades quality for the lower memory footprint. Between them, Wan 2.2 sits in the prosumer 24-48 GB sweet spot — the band that matches a 4090, a 5090, or an A6000 workstation.
Local Wan 2.2 has no content filter. No usage cap. No prompt logging. No telemetry to a vendor that can change the rules in a quarterly product review. You download the safetensors file once and the model never gets worse, never adds a new "safety" layer, never decides one Tuesday morning that your past prompts violate a new policy. For users who left closed AI specifically because of moderation, that's the entire pitch.
If you were waiting for Wan 2.5 to fix Wan 2.2's rough edges, stop waiting. The version of Wan you can actually own is the one already on Hugging Face.
The Three Wan 2.2 Variants — Which One You Run
T2V A14B — text-to-video, 14B parameters, 48 GB VRAM at fp8. Roughly 4-8 minutes for a 5-second 720p clip on an A6000. Pure prompt-to-video, no image conditioning. Best for fully synthetic shots where you want the model to invent the entire scene from text.
I2V A14B — image-to-video, 14B parameters, 48 GB VRAM at fp8. You feed it a still — typically the output of FLUX, Pony, or Illustrious — and the model animates it. This is the most-used Wan variant in 2026 because the dominant prosumer pipeline is "perfect-still in an image model, then animate via Wan I2V." You get the precise composition and aesthetic control of an image model and the temporal coherence of a video model. T2V can't match that level of art direction.
TI2V 5B — hybrid 5B model handling both text-to-video and image-to-video from a single checkpoint, 24 GB VRAM at fp8, 2-4 minutes per clip. Quality is meaningfully lower than the A14B variants — softer details, more motion artifacts on complex scenes — but TI2V is the only Wan checkpoint that fits on a single 4090 without offloading. If you're on consumer hardware, this is your entry point.
Prerequisites
- ComfyUI installed (portable Windows build or
git clone https://github.com/comfyanonymous/ComfyUIon Linux/Mac) - ComfyUI-Manager installed — strongly recommended for managing custom nodes and dependencies
- 24-48 GB VRAM — RTX 4090, RTX 5090, A6000, dual 3090s with NVLink, or equivalent
- 100 GB free disk space for model weights, VAE, text encoder, and ComfyUI cache
- Python 3.10+ with a CUDA-matched PyTorch build (CUDA 12.4 or 12.6 are the typical targets in 2026)
Step 1: Install The ComfyUI-WanVideo Custom Node
The custom-node package everyone uses is kijai/ComfyUI-WanVideoWrapper. Install it via the Manager or clone manually.
Via ComfyUI-Manager: open Manager → Install Custom Nodes → search "WanVideo" → install ComfyUI-WanVideoWrapper by kijai. Restart ComfyUI when prompted.
Manual install:
``bash cd ComfyUI/custom_nodes git clone https://github.com/kijai/ComfyUI-WanVideoWrapper.git cd ComfyUI-WanVideoWrapper pip install -r requirements.txt ``
Restart ComfyUI. The new node categories — WanVideo Model Loader, WanVideo Sampler, WanVideo TextEncode, and friends — will show up in the right-click node menu.

Step 2: Download The Wan 2.2 Weights
Pull from the official Wan-AI organization on Hugging Face: Wan-AI/Wan2.2-I2V-A14B, Wan-AI/Wan2.2-T2V-A14B, and Wan-AI/Wan2.2-TI2V-5B. The recommended quantization is fp8 — specifically the _fp8_e4m3fn files. For Wan 2.2 I2V A14B that's Wan2.2-I2V-A14B_fp8_e4m3fn.safetensors, roughly 13-15 GB on disk. fp8 is the sweet spot: visually indistinguishable from fp16 in side-by-side tests, but half the VRAM and faster sampling.
Three things matter for placement. The diffusion model goes in ComfyUI/models/diffusion_models/ — not checkpoints/. Wan uses the diffusion_models folder because the architecture is a diffusion transformer rather than a Stable Diffusion-style UNet. The VAE goes in ComfyUI/models/vae/. The UMT5-XXL text encoder goes in ComfyUI/models/text_encoders/. For the I2V variants you also need CLIP-Vision-H in ComfyUI/models/clip_vision/.
Sample directory layout for an I2V setup:
``text ComfyUI/models/ ├── diffusion_models/ │ └── Wan2.2-I2V-A14B_fp8_e4m3fn.safetensors ├── vae/ │ └── Wan2.1_VAE.safetensors ├── text_encoders/ │ └── umt5_xxl_fp8_e4m3fn_scaled.safetensors └── clip_vision/ └── clip_vision_h.safetensors ``
Note the VAE is named Wan2.1_VAE.safetensors — Wan 2.2 reuses the 2.1 VAE intentionally. Don't try to swap in an SDXL or FLUX VAE; you'll get black frames.

Step 3: Load The Reference Workflow
The Wan-AI team and the WanVideoWrapper repo both publish reference workflow JSONs in custom_nodes/ComfyUI-WanVideoWrapper/example_workflows/. Drag the JSON file directly onto the ComfyUI canvas and the entire graph reconstructs — model loader, text encoder, sampler, decoder, video combine, all wired up.
Three reference workflows to know by name:
wanvideo_T2V_workflow.json— pure text-to-video with the A14B checkpointwanvideo_I2V_workflow.json— image-to-video with A14Bwanvideo_TI2V_workflow.json— hybrid 5B for either mode
Each workflow loads the right model, text encoder, VAE, and sampler chain for its variant. Start with the reference graph; don't try to wire WanVideo nodes from scratch on first attempt.
Step 4: The I2V Workflow Walkthrough
Walking through the I2V graph node by node, because it's the workflow most people actually run:
- WanVideo Model Loader — points at the I2V A14B fp8 safetensors file and loads it into VRAM. Set precision to fp8_e4m3fn here.
- WanVideo TextEncode — UMT5-XXL encodes positive and negative prompts into the text-conditioning embedding the diffusion transformer expects. Use one TextEncode node per prompt direction.
- Load Image — your input still. Output of an image model, a photograph, or a previous video frame all work.
- WanVideo I2V Image Encode — encodes the still and extracts CLIP-Vision features. This is where the model gets the spatial conditioning that anchors the animation to your input image.
- WanVideo Sampler — the diffusion step itself. Steps, CFG, sampler, scheduler, frame count, seed all live here. This is the node you'll spend the most time tuning.
- WanVideo Decode — the VAE decode that turns the latent video tensor into pixel frames.
- Video Combine — frames-to-mp4 encoding via ffmpeg. Set fps and codec here.
For source stills, the standard prosumer pipeline is to render the keyframe in an image model first and then animate it.
Recommended Sampler Settings
A working starting point for I2V A14B:
``text Sampler: euler Scheduler: simple or beta (beta gives slightly cleaner motion) Steps: 25-30 (more than 30 stops helping) CFG: 6.0 (range 5.5-7 OK; below 5 loses prompt adherence, above 7 burns) Frame count: 81 (5-second clip at 16 fps) Resolution: 720x1280 portrait or 1280x720 landscape (16:9 / 9:16) Seed: random for exploration, fixed for reproducibility ``
Full settings comparison across the three variants:
| Parameter | T2V A14B | I2V A14B | TI2V 5B |
|---|---|---|---|
| Min VRAM (fp8) | 38 GB | 40 GB | 22 GB |
| Recommended steps | 25-30 | 25-30 | 20-25 |
| CFG | 6.0 | 6.0 | 5.5 |
| Frame count | 81 (5s @ 16fps) | 81 | 81 |
| Default resolution | 1280x720 | matches input | 768x1280 |
| Sampler | euler / unipc | euler / unipc | euler |
The "Min VRAM" column assumes fp8 quantization with sage attention enabled. Without sage attention, add roughly 4-6 GB to each row.
Common Errors And Fixes
- "CUDA out of memory" — drop to fp8 if you're on fp16, reduce frame count from 81 to 49 or 33, enable sage attention or xformers. If you're already at fp8 with 49 frames and still OOM-ing, you're on the wrong variant — switch to TI2V 5B.
- "NoneType has no attribute" in WanVideo Sampler — the text encoder didn't load. Check the WanVideo TextEncode node has UMT5-XXL selected, not a stale CLIP reference from a SDXL workflow.
- Black frames in output — wrong VAE. Wan needs
Wan2.1_VAE.safetensors. SDXL VAE, FLUX VAE, or any other VAE produces black or noise frames. Re-download the VAE fromWan-AI/Wan2.2-I2V-A14Band place it inComfyUI/models/vae/. - Choppy or flickering motion — too few steps, or aggressive quantization. Q4 GGUF degrades temporal coherence in ways you'll see immediately in the output. Stay on fp8. If you're already at fp8, push steps from 20 to 30.
- "Module not found" on first run — custom-node dependencies didn't install. Activate the venv ComfyUI uses and run
pip install -r requirements.txtincustom_nodes/ComfyUI-WanVideoWrapper. On portable Windows builds, use the embedded Python:python_embeded\python.exe -m pip install -r ....
Performance Tuning
sageattention — install via pip install sageattention and the WanVideoWrapper will pick it up automatically. Replaces standard scaled-dot-product attention with a kernel-fused version. Roughly 30-40% memory reduction on the attention path and a modest speed win. This is the single highest-impact optimization for Wan workflows.
xformers — alternative attention optimization. Slightly less effective than sageattention but more compatible with older PyTorch builds and pre-Ada GPUs. Good fallback if sageattention won't compile on your stack.
fp8 vs Q4 GGUF — fp8 is the recommended quantization. Q4 GGUF saves disk space and lets the model fit on smaller cards, but it visibly degrades temporal coherence — frames lose internal consistency, motion becomes lurchy. fp8 is the floor; don't go lower.
TeaCache — timestep-based caching custom node that skips redundant computation between similar timesteps. Roughly 2x speedup at a small quality cost. Worth enabling for prototyping passes when you're hunting for the right prompt; disable for final renders.
Frame interpolation (RIFE / FILM) — post-process Wan's 16fps output to 32fps or 48fps for smoother playback. Standard RIFE and FILM nodes exist in the ComfyUI ecosystem and slot in after Video Combine. Cheaper than rendering more frames natively.
Workflow Patterns That Work
Iterate at low resolution, render at high. Generate dozens of variations at 768x768 with 25 steps to find the prompt and seed that work. Then re-render the keepers at full 1280x720 with 30 steps. Wan sampling time scales hard with both resolution and step count; treat the low-res pass as the search and the high-res pass as the commit.
I2V chain for longer clips. Wan 2.2 caps comfortably at around 5 seconds per generation. For longer outputs, generate the first 5-second clip, take the last frame, feed it back as the I2V input for the next 5 seconds, and repeat. Stitch the segments together for ~20-30 seconds of consistent motion. Quality drift accumulates — by clip four or five the subject starts shifting — but for short-form output it works.
Hybrid pipeline. The standard 2026 prosumer stack is: image model (Pony, Illustrious, or FLUX) for the keyframe → Wan I2V for the animation → RIFE for frame interpolation → ffmpeg encoding to web-friendly mp4. Each stage does what it's best at. Don't try to make Wan do composition work an image model handles better, and don't try to make an image model do motion.
Why Not Use Wan 2.5 / Veo / Sora?
Wan 2.5 is closed-weights, API-only, with prompt logging and content moderation at the API layer. Sora 2 and Veo 3 — same architecture of access. Closed APIs, gated keys, mandatory moderation, terms-of-service that can change without notice. Kling, Runway, Pika round out the pack and they all moderate, all log, all close.
The trade is honest. Closed APIs produce slightly cleaner output today — call it a 12-month lead on cinematic polish, sharper detail, more reliable motion. Local Wan 2.2 is uncensored, untracked, and untouchable by future policy changes. For users who left closed AI specifically because of moderation, the choice is already made. For the cinematic-polish crowd that doesn't care about moderation, the closed APIs are fine. This guide is not for them.
The other consideration: open-weights models don't get worse. The Wan 2.2 checkpoint on your disk in 2026 is the same checkpoint in 2027 and 2030. Closed APIs degrade in subtle ways — moderation tightens, prompts that worked last quarter get rejected this quarter, the model gets quietly swapped for a cheaper one. Local inference is the only path where the model you tested is the model you ship.
Alternatives if Wan 2.2 doesn't fit your hardware:




