NVIDIA Nemotron-Labs-Diffusion hits 3× speedup over Qwen Eagle with hybrid generation modes

NVIDIA released six Nemotron-Labs-Diffusion models—3B, 8B, and 14B language models plus an 8B vision-language variant—that switch between autoregressive, diffusion, and speculative generation modes, achieving 3× throughput gains in speculative mode.

May 18, 2026

NVIDIA Nemotron-Labs-Diffusion hits 3× speedup over Qwen Eagle with hybrid generation modes

NVIDIA released Nemotron-Labs-Diffusion this week, a collection of six hybrid models that blend autoregressive and diffusion generation in a single architecture. The lineup includes 3B, 8B, and 14B language models—each in base and aligned versions—plus an 8B vision-language model.

The models operate in three distinct modes: standard autoregressive generation for quality-focused tasks, diffusion-based generation for diversity, and a speculative mode that drafts output via diffusion then self-corrects with autoregressive sampling. In speculative mode, the 8B model achieves roughly 3× the throughput of Qwen3-8B-Eagle3 while matching or slightly exceeding Qwen-3-8B on quality benchmarks. NVIDIA acknowledges that Qwen 3.5 has since shipped, shifting the competitive baseline.

The vision-language variant extends the hybrid approach to multimodal tasks, though the release materials emphasize text generation performance. All six models are available on HuggingFace under permissive licenses. The technical report details the training recipe and speculative decoding mechanics but does not publish full benchmark tables against the newest Qwen or Llama releases.

The hybrid architecture trades model complexity for inference flexibility—practitioners can toggle between modes depending on whether they prioritize speed, sample diversity, or a balance of both. Adoption will hinge on how the speed-quality tradeoff performs against pure autoregressive models already optimized for speculative decoding.

More in Releases