IDEAL discrete autoencoder achieves 0.61 rFID by aligning shallow and deep vision features

A new discrete representation autoencoder jointly aligns quantized visual tokens with both shallow and deep vision foundation model features, reaching state-of-the-art reconstruction and autoregressive image generation scores.

ByAlex Sokoloff·June 12, 2026

IDEAL discrete autoencoder achieves 0.61 rFID by aligning shallow and deep vision features

IDEAL is a discrete representation autoencoder that aligns quantized visual tokens with both shallow and deep features from pretrained vision foundation models. The approach addresses a persistent limitation in existing representation autoencoders: deep VFM features carry rich semantics but lose fine-grained visual detail, especially after discretization. By jointly aligning tokens with shallow features—which preserve local appearance and structural information—and deep features, IDEAL produces discrete tokens that maintain both visual fidelity and high-level meaning.

Representation autoencoders have become a popular architecture for building semantically rich latent spaces on top of pretrained vision models. The typical pattern is to extract deep features from a foundation model and use them as supervision for a learned tokenizer. That works well for capturing high-level concepts—object categories, scene layout, compositional structure—but the resulting tokens often fail to reconstruct fine textures, edges, and local color variation. Once those tokens are discretized into a finite codebook, the missing low-level information becomes nearly impossible to recover during decoding. IDEAL's insight is that shallow VFM layers retain exactly the local detail that deep layers discard, and that aligning tokens with both depth levels simultaneously lets the autoencoder preserve the full spectrum of visual information.

On ImageNet reconstruction, IDEAL achieves 0.61 rFID, outperforming the previous best method by 0.28. When applied to autoregressive image generation, the framework delivers a gFID of 1.89, setting a new benchmark for autoregressive models. The dual-alignment mechanism includes extensive ablations showing that shallow features contribute measurably to reconstruction quality, and the improvement holds across different VFM backbones, suggesting the strategy is architecture-agnostic. The same principle—extracting complementary information from multiple network depths—could extend to video tokenization, where temporal consistency and fine motion detail are equally hard to preserve in a discrete latent space.

ZenCreator

IDEAL discrete autoencoder achieves 0.61 rFID by aligning shallow and deep vision features

More in Releases

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation