Nvidia NVFP4 quantization keeps Kimi-K2.6 accuracy intact across six benchmarks
Nvidia released NVFP4-quantized versions of Moonshot AI's Kimi-K2.6 and Kimi-K2.5 language models, matching or exceeding baseline INT4 accuracy while enabling commercial use.

Nvidia released NVFP4-quantized versions of Moonshot AI's Kimi-K2.6 and Kimi-K2.5 language models this week on HuggingFace. Kimi-K2.6-NVFP4 is a quantized checkpoint of Moonshot AI's auto-regressive transformer model, compressed using Nvidia's Model Optimizer toolchain. Both checkpoints are cleared for commercial and non-commercial use.
The NVFP4 quantization holds or improves on the baseline INT4 format across six benchmarks. On SciCode, NVFP4 scores 54.4 versus 52.6 for native INT4—a 1.8-point gain. MMMU Pro climbs to 76.5 from 75.6, and AA-LCR rises to 71.8 from 71.0. GPQA Diamond dips slightly to 90.4 from 90.9, while τ²-Bench Telecom and IFBench remain nearly flat at 98.0 and 73.9. All benchmarks ran at temperature 1.0, top_p 0.95, and a 128,000-token context window.
Nvidia's Model Optimizer is a GitHub-hosted quantization framework designed to reduce inference overhead without retraining. NVFP4 appears to be a 4-bit floating-point scheme optimized for Nvidia hardware, though the model cards do not specify GPU requirements or publish inference speed or memory-footprint comparisons against the baseline. The next question is whether NVFP4 delivers measurable latency or throughput gains on consumer RTX or datacenter H100 hardware—and whether Nvidia will publish those numbers alongside the weights.