Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
翻译:遥感中的土地覆盖理解日益需要类别无关的系统,这些系统能够在不同数据集间泛化,同时保持空间精度和可解释性。我们研究领域偏移下的几何优先发现与解释场景,其中候选区域以类别无关方式划分,且监督通过匿名标识符避免使用词汇类别名称。作为开放集识别和开放世界学习的补充,我们聚焦于将类别无关的掩码证据与基于分类体系的场景解释相结合,而非未知类别拒绝或持续类别扩展。我们提出MVT,一个三阶段框架:(i)使用经领域适配的SAM2提取边界保真的区域掩码;(ii)通过多模态大语言模型的双步LoRA微调,执行基于掩码的语义标注和场景描述生成;(iii)采用经分层专家评分校准的LLM-as-judge评分机制评估输出。在跨数据集分割迁移任务(在OpenEarthMap上训练,在LoveDA上评估)中,经领域适配的SAM2提升了掩码质量;同时,双步MLLM微调产生了更准确的对齐分类体系的标注和更具信息量的基于掩码的场景描述。