IDEAL discrete autoencoder achieves 0.61 rFID by aligning shallow and deep vision features
A new discrete representation autoencoder jointly aligns quantized visual tokens with both shallow and deep vision foundation model features, reaching state-of-the-art reconstruction and autoregressive image generation scores.

IDEAL is a discrete representation autoencoder that aligns quantized visual tokens with both shallow and deep features from pretrained vision foundation models. The approach addresses a persistent limitation in existing representation autoencoders: deep VFM features carry rich semantics but lose fine-grained visual detail, especially after discretization. By jointly aligning tokens with shallow features—which preserve local appearance and structural information—and deep features, IDEAL produces discrete tokens that maintain both visual fidelity and high-level meaning.
Representation autoencoders have become a popular architecture for building semantically rich latent spaces on top of pretrained vision models. The typical pattern is to extract deep features from a foundation model and use them as supervision for a learned tokenizer. That works well for capturing high-level concepts—object categories, scene layout, compositional structure—but the resulting tokens often fail to reconstruct fine textures, edges, and local color variation. Once those tokens are discretized into a finite codebook, the missing low-level information becomes nearly impossible to recover during decoding. IDEAL's insight is that shallow VFM layers retain exactly the local detail that deep layers discard, and that aligning tokens with both depth levels simultaneously lets the autoencoder preserve the full spectrum of visual information.
On ImageNet reconstruction, IDEAL achieves 0.61 rFID, outperforming the previous best method by 0.28. When applied to autoregressive image generation, the framework delivers a gFID of 1.89, setting a new benchmark for autoregressive models. The dual-alignment mechanism includes extensive ablations showing that shallow features contribute measurably to reconstruction quality, and the improvement holds across different VFM backbones, suggesting the strategy is architecture-agnostic. The same principle—extracting complementary information from multiple network depths—could extend to video tokenization, where temporal consistency and fine motion detail are equally hard to preserve in a discrete latent space.






