GQLA unlocks dual decoding paths in DeepSeek attention for H100 and H20
New preprint proposes Group-Query Latent Attention, a drop-in modification of DeepSeek's MLA that exposes two decoding paths in the same weights—one for H100, one for export-restricted H20—without retraining or custom kernels.

Group-Query Latent Attention (GQLA) is a hardware-adaptive attention mechanism that solves a practical constraint in DeepSeek-V2 and V3's Multi-head Latent Attention (MLA). MLA's trained weights lock inference to a single decoding path optimized for H100-class GPUs, leaving export-restricted H20 cards and commodity hardware unable to exploit Multi-Token Prediction gains or tensor parallelism along the head axis. GQLA's weights expose two algebraically equivalent decoding paths over identical parameters—an MQA-absorb path for H100 and a GQA path with per-group expanded cache for H20—so the runtime selects whichever matches the target hardware without retraining or custom kernels.
On LLaMA-3-8B, the MQA-absorb path compresses per-token KV cache to 28.125% of the GQA baseline while the per-group path preserves GQA-level memory traffic and supports up to 8-way zero-redundancy tensor parallelism. A single GQLA weight set pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + Multi-Token Prediction, s_q=2). The authors also introduce TransGQLA, a conversion method that transforms pretrained GQA checkpoints into GQLA models without pretraining from scratch, posted to arXiv on May 18, 2026.