RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.

翻译：多模态大语言模型（MLLMs）在遥感（RS）领域的应用日益广泛，并在遥感视觉定位（RSVG）、遥感视觉问答（RSVQA）以及多模态对话等任务上展现出强大性能。然而，幻觉（即与输入遥感图像内容不一致的响应）严重阻碍了其在高风险场景（如应急管理、农业监测）中的部署，且在遥感领域尚未得到充分探索。本研究提出RSHallu，一项包含三项成果的系统性研究：（1）我们以遥感为导向的形式化分类体系定义了遥感幻觉，并引入了图像级幻觉以捕捉超越以对象为中心的错误（如模态、分辨率及场景级语义）的遥感特有不一致性；（2）我们构建了幻觉评估基准RSHalluEval（含2,023个问答对），并实现了双模式检验，支持通过基于RSHalluCheck数据集（含15,396个问答对）微调的紧凑检验器进行高精度云端审核与低成本可复现的本地检验；（3）我们引入了领域定制数据集RSHalluShield（含3万个问答对）用于训练友好的缓解，并进一步提出了无需训练即插即用的策略，包括解码时对数校正与遥感感知提示。在多个代表性遥感多模态大语言模型上，我们的缓解方法在统一协议下将无幻觉率最高提升了21.63个百分点，同时在下游遥感任务（RSVQA/RSVG）上保持了有竞争力的性能。代码与数据集将公开。