NVIDIA quantizes Wan2.2 text-to-video to FP8 for Blackwell inference
NVIDIA released an FP8-quantized version of Wan2.2-T2V-A14B optimized for Blackwell architecture and TensorRT-LLM, calibrated on OpenVid-1M.
NVIDIA released an FP8-quantized version of Wan2.2-T2V-A14B this week, optimized for Blackwell GPUs and integrated with TensorRT-LLM. The 14-billion-parameter text-to-video model uses MXFP8 block-scaling quantization and was calibrated on the OpenVid-1M dataset. The weights are available in Diffusers format.
The FP8 precision cuts memory footprint roughly in half compared to FP16 while preserving most of the original model's output quality. NVIDIA's MXFP8 implementation applies block-level scaling factors to maintain numerical stability across the transformer blocks. The Blackwell-specific optimizations target the architecture's tensor cores, which handle FP8 matrix operations natively. TensorRT-LLM integration means the model can run inference through NVIDIA's optimized runtime, bypassing some of the overhead in generic PyTorch pipelines.
Early testers report generation speeds around 1.8 seconds per frame on a B200 GPU, though those numbers are unofficial and depend heavily on resolution and frame count. Whether NVIDIA will backport the quantization recipe to Hopper or older architectures remains unclear, as does whether the FP8 checkpoint will eventually replace the FP16 original as the default download. A full benchmark suite comparing output quality frame-by-frame against the original weights would clarify how much fidelity the quantization actually costs in practice—the real test for practitioners deciding whether to adopt the compressed version.
