PanoWorld processes 360° panoramas as unified spaces, not fragmented crops
Researchers introduce PanoWorld, a multimodal model with Spherical Spatial Cross-Attention that processes equirectangular panoramas natively for navigation and 3D scene understanding.

PanoWorld is a multimodal large language model that processes 360-degree panoramas as continuous observer-centered spaces rather than decomposing them into multiple perspective views. The model injects spherical geometry directly into the visual stream using Spherical Spatial Cross-Attention, enabling what the authors call "pano-native understanding" — reasoning over equirectangular projection (ERP) panoramas without losing the spherical structure that existing MLLMs discard. This addresses a core limitation: most multimodal models inherit the narrow field of view of human-like perception and struggle with spatial tasks like navigation, robotic search, and 3D scene understanding.
The research team built a large-scale metadata pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision. They defined four key abilities for pano-native understanding: semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. To evaluate these capabilities, they constructed PanoSpace-Bench, a diagnostic benchmark for ERP-native spatial reasoning. On PanoSpace-Bench, H* Bench, and the R2R-CE Val-Unseen navigation benchmark, PanoWorld substantially outperformed both proprietary and open-source baselines. The preprint, posted in May 2026, notes that all source code and proposed data will be publicly released, demonstrating that robust panoramic reasoning requires dedicated pano-native supervision rather than perspective-view decomposition.