This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering. We will release the code, data and model weights at https://github.com/opendatalab/VHM .
翻译:本文开发了一种用于遥感图像分析的多功能与可信视觉语言模型(VHM)。VHM构建于大规模遥感图像-文本数据集(VersaD)与包含事实性和欺骗性问题的可信指令数据集(HnstD)之上。与当前主流的遥感图像-文本数据集(其图像描述通常聚焦于少数显著目标及其关系)不同,VersaD的描述提供了关于图像属性、目标特征及整体场景的详细信息。这种全面的描述能力使VHM能够深入理解遥感图像并执行多样化的遥感任务。此外,与仅包含事实性问题的现有遥感指令数据集相比,HnstD额外纳入了源于目标不存在的欺骗性问题。这一特性可防止VHM对无意义查询给出肯定性回答,从而确保其回答的可信度。实验表明,VHM在场景分类、视觉问答和视觉定位等常见任务上显著优于多种视觉语言模型。同时,VHM在建筑矢量化、多标签分类和可信问答等若干未探索任务上也展现出优异性能。相关代码、数据及模型权重将在https://github.com/opendatalab/VHM 发布。