Sat3DGen cuts satellite-to-street 3D geometry error to 5.20m RMSE
A new geometry-first method from Qian et al. generates street-level 3D scenes from single satellite images, improving geometric accuracy by 23% and photorealism by more than half versus the prior leading approach.

Sat3DGen is a geometry-first 3D scene generation method that reconstructs street-level environments from single satellite images. Researchers Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang, and Zeran Ke released the code and preprint this week on HuggingFace Papers.
The method tackles the extreme viewpoint gap between overhead satellite captures and ground-level street views — a problem that has forced prior techniques to choose between geometric fidelity and semantic richness. Geometry-colorization models produce accurate building shapes but lack diversity; proxy-based feed-forward frameworks generate holistic scenes with varied content but suffer from coarse, unstable geometry. Sat3DGen integrates novel geometric constraints and a perspective-view training strategy into the feed-forward paradigm, explicitly addressing the sparse and inconsistent supervision inherent in satellite-to-street data.
What stands out
- 01Geometric accuracy: On a new benchmark pairing the VIGOR-OOD test set with high-resolution Digital Surface Model (DSM) data, Sat3DGen achieves 5.20m root-mean-square error (RMSE) versus 6.76m for prior methods — a 23% improvement.
- 02Photorealism: The geometry-first approach reduces Fréchet Inception Distance (FID) from approximately 40 to 19 against the leading baseline, Sat2Density++, without any dedicated image-quality modules.
- 03Downstream applications: The high-quality 3D assets enable semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image DSM estimation.
- 04Training strategy: The perspective-view training explicitly counters the primary sources of geometric error — the viewpoint gap and sparse supervision — by enforcing consistency across synthesized ground-level views during optimization.