Mix-Quant achieves 3× prefilling speedup in agentic LLMs with phase-aware FP4 quantization
New framework from NUS and collaborators applies FP4 quantization to the prefilling stage while keeping BF16 precision for decoding, preserving task accuracy in long-context agent workflows.

Mix-Quant, a phase-aware quantization framework from researchers at the National University of Singapore and collaborators, accelerates the prefilling stage of agentic LLM inference by up to 3×. The paper addresses a bottleneck in multi-turn agent workflows: the compute-intensive prefilling phase that processes long input contexts from planning, tool calls, and memory retrieval before the model begins generating tokens. The team found that quantizing the entire inference pipeline to FP4 degrades task performance, but the prefilling stage alone tolerates aggressive quantization with minimal accuracy loss. Mix-Quant exploits this asymmetry by running prefilling in NVFP4—NVIDIA's hardware-accelerated 4-bit floating-point format—while keeping decoding in BF16 precision.
The approach decouples throughput optimization from output quality. During prefilling, the model processes the input context at high speed using 4-bit weights and activations; once token generation begins, the system switches to full BF16 for the autoregressive decoding loop. Experiments across long-context and agentic benchmarks show the method largely preserves downstream task accuracy while cutting prefilling latency by a factor of three.