Gemma 4 MTP trails DFlash on MoE, leads on dense in single-H100 speculative-decoding test

Single-H100 benchmark shows Google's Multi-Token Prediction holds a 3% lead over z-lab's DFlash on the 31B dense model but trails by 16% on the 26B MoE, with speculative-decoding gains inversely tied to active parameter count.

May 13, 2026

Gemma 4 MTP trails DFlash on MoE, leads on dense in single-H100 speculative-decoding test

Speculative decoding's effectiveness hinges on the target model's baseline cost: on dense architectures, Google's Multi-Token Prediction (MTP) outpaced z-lab's DFlash by a narrow margin, but the advantage flipped entirely when the same models switched to mixture-of-experts routing.

A benchmark test on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset compared both approaches across 880 prompts spanning 11 categories. The test evaluated google/gemma-4-31B-it (dense) and google/gemma-4-26B-A4B-it (MoE), with MTP drafting 8 tokens ahead and DFlash drafting 15. Prefix caching was disabled, temperature set to zero, and context length capped at 32,768.

On the 31B dense model at concurrency 1, MTP hit 125.3 output tokens per second — a 3.11× speedup over the 40.3 tok/s baseline — while DFlash reached 122.1 tok/s for a 3.03× gain. At concurrency 16, MTP scaled to 953 tok/s and DFlash to 725 tok/s, both well ahead of the baseline's 375 tok/s. The MoE variant reversed the ranking: baseline throughput started at 177.1 tok/s thanks to the model's 3.8B active parameters (out of 25.2B total), and DFlash jumped to 306.4 tok/s (1.73×) while MTP reached 264.2 tok/s (1.49×). At concurrency 16, DFlash led again with 1,957 tok/s versus MTP's 1,808 tok/s.

The smaller MoE speedups reflected the target model's lower inference cost — speculative decoding removes less compute when the base model is already cheap to run. Task type also mattered: coding, math, STEM, and reasoning workloads saw larger gains because token patterns are more predictable, while writing, summarization, and roleplay improved less. Per-position acceptance rates did not directly predict overall throughput; the interaction between draft quality, target-model cost, and workload structure determined the final speedup.

More in Community