Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.
翻译:统一的遥感多模态模型表现出显著的空间逆转诅咒:尽管它们能够准确识别和描述图像中物体的位置,但在执行文本到图像的生成任务时,却常常无法忠实地实现相同的空间关系,而这些关系构成了遥感领域的核心语义信息。基于这一观察,我们提出了Uni-RS,这是首个为遥感领域定制的统一多模态模型,旨在明确解决理解与生成之间的空间不对称问题。具体而言,我们首先引入显式的空间布局规划,将文本指令转化为空间布局方案,从而将几何规划与视觉合成解耦。接着,我们施加空间感知查询监督,使可学习的查询偏向于指令中明确指定的空间关系。最后,我们开发了图像-描述空间布局变体方法,使模型能够接触到系统性的几何一致的空间变换。在多个基准测试上进行的大量实验表明,我们的方法显著提升了文本到图像生成中的空间保真度,同时在图像描述、视觉定位和视觉问答等多模态理解任务上保持了强大的性能。