DepthVLM adds native 3D geometry to vision-language models
A new framework transforms vision-language models into native dense geometry predictors by adding a lightweight depth head and training under unified vision-text supervision, outperforming both existing VLMs and pure vision models.

DepthVLM is a framework from researchers at Zhejiang University that transforms a single vision-language model into a dense geometry predictor without sacrificing multimodal capability. The system attaches a lightweight depth head to the LLM backbone and trains under a two-stage schedule, generating full-resolution depth maps alongside language outputs in a single forward pass.
Vision-language models typically excel at 2D tasks like grounding and captioning but struggle with 3D understanding. The root cause is text-only supervision, which under-constrains fine-grained visual perception and blocks dense geometry recovery. Prior approaches either distill geometry from external vision models—introducing error accumulation—or enable direct prediction through inefficient per-pixel queries or coarse token-level outputs.
Benchmark results
The team introduced a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models on depth estimation, and improves complex 3D spatial reasoning tasks. The framework preserves the model's original multimodal capabilities while adding native geometry prediction.
All code and checkpoints will be publicly released. The work is authored by Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, and Lei Ke.