StitchVM transfers reward models to diffusion latent space in 10 GPU-hours
A new model-stitching framework attaches frozen diffusion backbones to pretrained CLIP reward models, yielding a value function for noisy latents that speeds up DPS by 3.2× and cuts memory use in half.
StitchVM, a model-stitching framework from researchers across multiple institutions, transfers pixel-space reward models to the noisy latent regime of diffusion models, enabling faster and more memory-efficient alignment. The method attaches a frozen diffusion backbone—such as Stable Diffusion 3.5 Medium—to a truncated pixel-space reward model like CLIP ViT-L, creating a hybrid that can evaluate noisy intermediate latents directly. Stitching and finetuning the combined model takes only 10 GPU-hours, a fraction of the cost of Monte Carlo rollout methods that repeatedly denoise samples to estimate value.
Diffusion alignment typically requires estimating how good a noisy latent will be once fully denoised, a problem that existing methods solve either with biased Tweedie approximations or expensive Monte Carlo sampling. StitchVM sidesteps that trade-off by learning the value function for noisy latents once, then amortizing it across many samples and training iterations. The resulting value model retains the robust reward signal of the original pixel-space model while inheriting the diffusion backbone's native ability to handle noise. In downstream experiments, DPS (Diffusion Posterior Sampling) ran 3.2 times faster and used half the peak GPU memory when guided by StitchVM, and DiffusionNFT saw a 2.3× speedup.
The authors position StitchVM as a general recipe for lifting any pretrained reward model—aesthetic scorers, prompt-fidelity classifiers, or task-specific detectors—into the latent space of any diffusion architecture. Because the diffusion backbone stays frozen during stitching, the method scales to large models without retraining from scratch. Open questions remain about wall-clock timing for the full alignment pipeline, performance against recent Direct Preference Optimization variants, and whether the 10-hour stitching cost holds across heavier backbones like FLUX or Hunyuan.
