This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering. We will release the code, data and model weights at https://github.com/opendatalab/VHM .
翻译:本文开发了一种用于遥感图像分析的多功能与诚实视觉语言模型(VHM)。VHM构建于包含丰富内容描述的大规模遥感图像-文本数据集(VersaD)以及由事实性与欺骗性问题构成的诚实指令数据集(HnstD)之上。与当前主流的遥感图像-文本数据集(其图像描述仅聚焦于少数显著目标及其关系)不同,VersaD的描述提供了关于图像属性、目标特征及整体场景的详细信息。这种全面的描述方式使VHM能够深入理解遥感图像并执行多样化的遥感任务。此外,与现有仅包含事实性问题的遥感指令数据集相比,HnstD额外纳入了源于目标不存在的欺骗性问题。这一特性可防止VHM对无意义查询给出肯定性答复,从而确保其回答的诚实性。在实验中,VHM在场景分类、视觉问答和视觉定位等常见任务上显著优于多种视觉语言模型。同时,VHM在建筑矢量化、多标签分类和诚实问答等若干未探索任务上也展现出优异性能。我们将通过 https://github.com/opendatalab/VHM 公开代码、数据及模型权重。