Alibaba DAR doubles early training speed in diffusion transformers via adaptive routing
Alibaba researchers propose Diffusion-Adaptive Routing (DAR), a timestep-dependent layer-merging technique that replaces residual connections in diffusion transformers, preserving high-frequency detail during distillation and accelerating early training by 2× when paired with REPA.
Alibaba researchers have published a preprint introducing Diffusion-Adaptive Routing (DAR), a replacement for residual connections in diffusion transformers. The technique routes layer outputs and denoising steps based on the current timestep, aiming to preserve high-frequency detail when distilling large text-to-image models.
According to the arXiv preprint (2605.20708), DAR doubles training speed during early phases when combined with REPA, another optimization method. The approach dynamically adjusts how layers are combined as the denoising process progresses, rather than using fixed residual paths throughout.
What stands out
- 01Timestep-dependent routing. DAR adapts its layer-merging strategy to the current diffusion timestep, allowing the network to handle different noise levels with different computational paths.
- 02High-frequency preservation. The method is designed to retain fine detail during model distillation, a process that often blurs sharp edges and textures in image generation.
- 032× early training speedup. When paired with REPA, DAR cuts early-stage training time in half compared to standard residual connections, though the preprint does not specify hardware or dataset details.
- 04Targets large text-to-image models. The technique is framed for distilling and accelerating models at the scale of FLUX, Stable Diffusion 3, and similar transformer-based diffusion architectures.
- 05Preprint stage only. No code, weights, or implementation details have been released. The arXiv submission remains the sole public artifact.


