H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

The generic large Vision-Language Models (VLMs) is rapidly developing, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack of large-scale, high-quality RS vision-language datasets. We constructed HqDC-1.4M, the large scale High quality and Detailed Captions for RS images, containing 1.4 million image-caption pairs, which not only enhance the RSVLM's understanding of RS images but also significantly improve the model's spatial perception abilities, such as localization and counting, thereby increasing the helpfulness of the RSVLM. Moreover, to address the inevitable "hallucination" problem in RSVLM, we developed RSSA, the first dataset aimed at enhancing the Self-Awareness capability of RSVLMs. By incorporating a variety of unanswerable questions into typical RS visual question-answering tasks, RSSA effectively improves the truthfulness and reduces the hallucinations of the model's outputs, thereby enhancing the honesty of the RSVLM. Based on these datasets, we proposed the H2RSVLM, the Helpful and Honest Remote Sensing Vision Language Model. H2RSVLM has achieved outstanding performance on multiple RS public datasets and is capable of recognizing and refusing to answer the unanswerable questions, effectively mitigating the incorrect generations. We will release the code, data and model weights at https://github.com/opendatalab/H2RSVLM .

翻译：通用大型视觉语言模型（VLM）发展迅速，但在遥感（RS）领域表现仍不佳，这是由于遥感图像具有独特性和专业性，且当前VLM的空间感知能力相对有限。现有的遥感专用视觉语言模型（RSVLM）仍存在显著改进空间，主要归因于缺乏大规模、高质量的遥感视觉语言数据集。我们构建了HqDC-1.4M——大规模、高质量遥感图像详细描述数据集，包含140万对图像-描述对，不仅增强了RSVLM对遥感图像的理解，还显著提升了模型的定位、计数等空间感知能力，从而提高了RSVLM的有用性。此外，为解决RSVLM中不可避免的“幻觉”问题，我们开发了RSSA——首个旨在增强RSVLM自我感知能力的数据集。通过在典型遥感视觉问答任务中融入多种不可回答的问题，RSSA有效提升了模型输出的真实性并减少了幻觉现象，从而增强了RSVLM的诚实性。基于这些数据集，我们提出了H2RSVLM——有用且诚实的遥感视觉语言模型。H2RSVLM在多个遥感公开数据集上取得了优异性能，并能识别并拒答不可回答的问题，有效缓解了错误生成。我们将于https://github.com/opendatalab/H2RSVLM 公开代码、数据及模型权重。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日