Realiz3D decouples visual domain from 3D control to fix synthetic render artifacts
A new training framework separates visual domain from geometric control signals, letting diffusion models generate photorealistic 3D-consistent images without the synthetic render artifacts that plague current fine-tuning approaches.

Fine-tuning image generators on synthetic 3D renders teaches them geometry and viewpoint control, but the output often looks like a render—flat lighting, plastic materials, telltale CG sheen. Realiz3D, a preprint from researchers at Google and the University of Oxford, argues the problem isn't the synthetic training data itself but an unintended correlation: the model learns to associate control signals with synthetic appearance.
The framework explicitly separates visual domain (real vs. synthetic) from other control inputs. A learned co-variate fed into small residual adapters shifts the domain, while the main model learns geometry, material, and viewpoint controls independently. During inference, the domain token can be set to "real," letting the generator apply 3D controls without defaulting to the synthetic look baked into the training renders.
The authors also identify which diffusion layers and denoising timesteps matter most for domain transfer. Early denoising steps and deeper network layers carry more domain-specific information; the framework uses that insight to weight training losses and guide sampling. Realiz3D is demonstrated on text-to-multiview generation and 3D texturing tasks, producing outputs that are both 3D-consistent and photorealistic—a pairing current methods struggle to deliver simultaneously.