Orthrus achieves 7.8× speedup by pairing frozen LLM with parallel diffusion module
A new arXiv preprint introduces Orthrus, a framework that pairs a frozen autoregressive LLM with a lightweight diffusion module to enable parallel token generation while preserving exact output fidelity and adding only minimal memory overhead.
Orthrus is a dual-architecture inference framework that combines autoregressive and diffusion-based token generation in a single system, according to an arXiv preprint published May 14. The framework augments a frozen large language model with a trainable diffusion module that shares the same key-value cache, enabling parallel generation without sacrificing the exact output fidelity of standard autoregressive decoding. The paper reports speedups of up to 7.8× over baseline autoregressive inference with O(1) memory cache overhead and minimal parameter additions.
Standard autoregressive LLMs generate tokens one at a time, creating a throughput bottleneck for high-speed inference workloads. Diffusion language models attempt to parallelize generation but typically suffer from accuracy degradation, high training costs, and weak convergence guarantees. Orthrus addresses both limitations by running two "views" in tandem: the autoregressive head pre-fills context to build accurate KV representations, while the diffusion head executes parallel token generation from those same representations. An exact consensus mechanism between the two views ensures that the final output matches what the frozen LLM would produce in standard autoregressive mode, guaranteeing lossless inference.
Architecture and integration
Both the autoregressive and diffusion heads attend to the same high-fidelity key-value cache. The autoregressive component handles context encoding and establishes the KV state; the diffusion component then samples multiple tokens in parallel from that state. Because the diffusion module is lightweight and the base LLM remains frozen, the framework integrates into existing Transformer architectures without retraining the underlying model. The consensus mechanism—enforcing agreement between the two views—is what preserves exact generation fidelity, a property diffusion-only approaches typically cannot guarantee.
The preprint does not specify training data, hardware requirements, or release plans for weights. The authors describe Orthrus as "designed to seamlessly integrate into existing Transformers," suggesting broad applicability to current open-weight and proprietary models, though no specific model names or benchmark comparisons against named baselines appear in the abstract. The 7.8× speedup figure represents the upper end of the reported range; the paper does not detail the lower bound or the conditions under which different speedups were measured.
