While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.
翻译:尽管多模态大语言模型在通用视觉语言任务中表现卓越,但其在遥感变化理解中的应用却受困于根本性的“时间盲区”。现有架构缺乏多时序对比推理的内在机制,且难以实现精准的空间定位。为此,我们首先提出Delta-QA,一个包含18万个视觉问答样本的综合基准数据集。Delta-QA统一了双时序和三时序场景下的像素级分割与视觉问答,将变化解释结构化为四个递进的认知维度。在方法论上,我们提出Delta-LLaVA,一种专为多时序遥感解释定制的新型多模态大语言模型框架。它通过三项核心创新克服了朴素特征拼接的局限:变化增强注意力模块,系统性地分离并放大视觉差异;基于变化先验嵌入的变化分割模块,提取可微分的差异特征作为大语言模型输入;以及局部因果注意力机制,防止跨时序上下文信息泄露。大量实验表明,Delta-LLaVA在复杂变化推断与高精度边界定位方面显著优于主流通用型多模态大语言模型及专用分割模型,为地球观测智能建立了统一框架。