Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.
翻译:全球南方的农业景观分割具有挑战性,其特点是地块破碎、类内方差高且标注训练数据稀缺。近年来,多模态大语言模型(MLLMs)在分割领域取得了进展。然而,现有方法在理解卫星特征时面临关键的上下文长度瓶颈和领域对齐鸿沟。我们通过MAgSeg——一种无解码器的新型MLLM分割方法——来解决这些局限。MAgSeg是一种架构高效的方法,使标准MLLMs能够无需辅助视觉解码器即可从高分辨率卫星图像中分割复杂的小农户农业景观。我们引入了一种新颖的指令微调数据格式,旨在实现高分辨率卫星图像的可扩展微调和后训练,使MAgSeg能够从图像的全局上下文中学习,同时仅生成图像内某个补丁的文本令牌。在覆盖全球南方三个国家数据集上的广泛评估表明,MAgSeg显著优于最先进的MLLM基准方法,为绘制小农户农业环境提供了可扩展的解决方案。