Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising Image Understanding Pre-training and Spatial Understanding Fine-tuning. Utilizing this trained spatially-aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot
翻译:尽管遥感视觉语言模型(VLMs)已取得显著进展,但现有模型通常在空间理解方面存在不足,限制了其在现实应用中的效能。为拓展VLMs在遥感领域的边界,我们针对无人机捕获的车辆图像,提出了一个空间感知数据集AirSpatial。该数据集包含超过20.6万条指令,并引入了两项新颖任务:空间定位与空间问答。它同时也是首个提供三维边界框(3DBB)的遥感定位数据集。为有效利用VLMs现有的图像理解能力至空间领域,我们采用了两阶段训练策略,包括图像理解预训练与空间理解微调。基于此训练得到的空间感知VLM,我们开发了一种空中智能体AirSpatialBot,能够实现细粒度的车辆属性识别与检索。通过动态整合任务规划、图像理解、空间理解与任务执行能力,AirSpatialBot可适应多样化的查询需求。实验结果验证了我们方法的有效性,揭示了现有VLMs的空间局限性,同时提供了有价值的见解。模型、代码及数据集将在 https://github.com/VisionXLab/AirSpatialBot 发布。