The generic large Vision-Language Models (VLMs) is rapidly developing, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack of large-scale, high-quality RS vision-language datasets. We constructed HqDC-1.4M, the large scale High quality and Detailed Captions for RS images, containing 1.4 million image-caption pairs, which not only enhance the RSVLM's understanding of RS images but also significantly improve the model's spatial perception abilities, such as localization and counting, thereby increasing the helpfulness of the RSVLM. Moreover, to address the inevitable "hallucination" problem in RSVLM, we developed RSSA, the first dataset aimed at enhancing the Self-Awareness capability of RSVLMs. By incorporating a variety of unanswerable questions into typical RS visual question-answering tasks, RSSA effectively improves the truthfulness and reduces the hallucinations of the model's outputs, thereby enhancing the honesty of the RSVLM. Based on these datasets, we proposed the H2RSVLM, the Helpful and Honest Remote Sensing Vision Language Model. H2RSVLM has achieved outstanding performance on multiple RS public datasets and is capable of recognizing and refusing to answer the unanswerable questions, effectively mitigating the incorrect generations. We will release the code, data and model weights at https://github.com/opendatalab/H2RSVLM .
翻译:通用大型视觉语言模型(VLM)发展迅速,但在遥感(RS)领域表现仍不佳,这是由于遥感图像具有独特性和专业性,且当前VLM的空间感知能力相对有限。现有的遥感专用视觉语言模型(RSVLM)仍存在显著改进空间,主要归因于缺乏大规模、高质量的遥感视觉语言数据集。我们构建了HqDC-1.4M——大规模、高质量遥感图像详细描述数据集,包含140万对图像-描述对,不仅增强了RSVLM对遥感图像的理解,还显著提升了模型的定位、计数等空间感知能力,从而提高了RSVLM的有用性。此外,为解决RSVLM中不可避免的“幻觉”问题,我们开发了RSSA——首个旨在增强RSVLM自我感知能力的数据集。通过在典型遥感视觉问答任务中融入多种不可回答的问题,RSSA有效提升了模型输出的真实性并减少了幻觉现象,从而增强了RSVLM的诚实性。基于这些数据集,我们提出了H2RSVLM——有用且诚实的遥感视觉语言模型。H2RSVLM在多个遥感公开数据集上取得了优异性能,并能识别并拒答不可回答的问题,有效缓解了错误生成。我们将于https://github.com/opendatalab/H2RSVLM 公开代码、数据及模型权重。