Qwen-Image-VAE-2.0 hits 64× compression with skip-connection architecture
Alibaba's Qwen team released a technical report on Qwen-Image-VAE-2.0, a high-compression variational autoencoder trained on billions of images that achieves state-of-the-art reconstruction fidelity at compression ratios up to 64×.

Qwen-Image-VAE-2.0, a suite of high-compression variational autoencoders from Alibaba's Qwen team, preserves image detail even at extreme compression ratios—up to 64×. The models introduce Global Skip Connections (GSC) and expanded latent channels to route fine-grained information directly from encoder to decoder, breaking the traditional trade-off between compression and reconstruction quality.
Variational autoencoders form the compression layer underneath most modern image diffusion models. A VAE encodes a high-resolution image into a compact latent representation, the diffusion model operates on that compressed space, then the VAE decoder reconstructs the final output. Higher compression means faster training and inference, but historically it has come at the cost of blurry reconstructions or lost detail. Qwen's architectural changes aim to solve that directly. Training scaled to billions of images and incorporated a synthetic rendering engine specifically to handle text-heavy documents—a domain where prior VAEs struggled. To make the high-dimensional latent space work well with diffusion models, the team implemented an enhanced semantic alignment strategy. The encoder uses an asymmetric, attention-free backbone to keep encoding costs low, a practical consideration for practitioners running inference on consumer hardware.
Downstream diffusion transformer (DiT) experiments converge significantly faster than with existing high-compression baselines, a property the paper calls superior "diffusability." The team also introduced OmniDoc-TokenBench, a new benchmark for evaluating VAE performance on real-world documents with OCR-based metrics. Text rendering has been a persistent weak spot for high-compression VAEs; blurred or garbled characters are a common artifact. Qwen-Image-VAE-2.0 achieves state-of-the-art scores on both general reconstruction benchmarks and the new text-rich test set. The arXiv preprint (2605.13565) was published May 14, 2026.