DRoRAE cuts ImageNet reconstruction error by 49% using multi-layer encoder fusion

A new autoencoder architecture fuses all encoder layers instead of just the last, recovering low-level visual detail and cutting reconstruction error in half on ImageNet-256.

May 13, 2026

DRoRAE cuts ImageNet reconstruction error by 49% using multi-layer encoder fusion

DRoRAE (Depth-Routed Representation AutoEncoder) is a visual tokenizer that fuses features from every layer of a frozen pretrained vision encoder, rather than extracting only the final layer's output. The architecture addresses a fundamental limitation in existing representation autoencoders: they universally discard the hierarchical information distributed across intermediate encoder layers, relying solely on the last layer where low-level visual details survive only as attenuated residuals after multiple rounds of semantic abstraction.

The method uses energy-constrained routing to adaptively aggregate all encoder layers, producing an enriched latent representation that remains compatible with a frozen pretrained decoder. A lightweight fusion module performs incremental correction across the depth of the encoder, recovering visual information that would otherwise be lost. Training decouples fusion learning from decoder optimization across three phases: first learning the fusion module under the implicit distributional constraint of the frozen decoder, then fine-tuning the decoder to fully exploit the enriched representation.

On ImageNet-256, DRoRAE reduces reconstruction FID (rFID) from 0.57 to 0.29—a 49 percent improvement over single-layer extraction. Generation FID drops from 1.74 to 1.65 when paired with AutoGuidance, and the gains transfer to text-to-image synthesis tasks. The work also uncovers a log-linear scaling law with R² = 0.86 between fusion capacity and reconstruction quality, identifying representation richness as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in language models. For practitioners building open-source image generation systems, the finding matters because most visual tokenizers in the wild—including those used in latent diffusion pipelines—extract features from a single encoder layer, leaving performance on the table. The scaling law suggests that increasing fusion capacity yields predictable quality gains, offering a concrete lever for model builders who have already exhausted gains from larger vocabularies or deeper decoders.

More in Releases