L2P transfers latent diffusion knowledge to pixel-space generation on 8 GPUs
New framework converts pre-trained latent diffusion models to pixel-space generators by freezing intermediate layers and training only shallow layers on synthetic images, enabling native 4K output without VAE overhead.

L2P, a transfer learning framework from researchers across multiple institutions, repurposes pre-trained latent diffusion models for direct pixel-space generation. The method, detailed in a preprint published May 13, freezes most of a source LDM's architecture and trains only the shallow layers to map latent representations to pixels, eliminating the variational autoencoder entirely.
The approach uses large-patch tokenization and trains exclusively on synthetic images generated by the source LDM itself. By fitting an already smooth data manifold rather than raw training data, L2P converges rapidly on a single 8-GPU setup. The authors report that L2P-converted models match their source LDMs on DPG-Bench and reach 93 percent performance on GenEval, while removing the VAE memory bottleneck unlocks native 4K ultra-high resolution generation that latent models cannot reach without upscaling.
What stands out
- 01Minimal training overhead: L2P runs on 8 GPUs with zero real-data collection, using only LDM-generated synthetic images as the training corpus.
- 02Performance parity despite architecture change: Converted models perform on par with source LDMs on DPG-Bench and hit 93% of source performance on GenEval, despite discarding the VAE.
- 03Native 4K generation: Eliminating the VAE memory bottleneck enables native 4K output, a ceiling that latent diffusion models cannot reach without post-hoc upscaling.
- 04Generalizable across LDM families: The paper tests L2P across mainstream latent diffusion architectures, showing the transfer paradigm is not tied to a single model.
- 05