OcclusionFormer fixes overlapping objects in layout-to-image generation with Z-order dataset
Researchers introduce OcclusionFormer, a Diffusion Transformer that explicitly models occlusion priority using SA-Z, a 50,000+ image dataset with pixel-level depth ordering, to eliminate texture entanglement when bounding boxes overlap.
Layout-to-image models have made strides in spatial control, but they falter when objects overlap. Without explicit occlusion information, bounding-box intersections produce entangled textures and physically inconsistent layering. OcclusionFormer, a new Diffusion Transformer framework from researchers Ziye Li and Henghui Ding, tackles this by modeling Z-order priority directly. The approach decouples instances and composites them via volume rendering, enforcing correct depth relationships from the start.
The framework pairs with SA-Z, a dataset of over 50,000 images annotated with occlusion ordering and pixel-level depth masks—filling a gap in existing layout-to-image training data. OcclusionFormer introduces a queried alignment loss that supervises individual instances separately, improving semantic consistency and spatial precision. Testing on standard layout-to-image benchmarks shows improved FID and alignment scores in overlapping scenarios, with cleaner boundaries and more plausible layering than LayoutDiffusion, GLIGEN, and other recent baselines. The method generalizes to arbitrary numbers of overlapping objects without scene-specific tuning.
Code and dataset access remain pending institutional review, but the preprint is live on arXiv. The open question is whether SA-Z becomes broadly available and whether the queried alignment loss transfers to other layout-grounded generation tasks—video, 3D scenes, or real-time interactive tools where occlusion handling is equally critical.
