Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains, particularly focusing on the frontier model GPT-4V, and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks, testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel, including instances where they outperform humans, but also where they falter, providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models, our benchmark will be publicly released.
翻译:多模态大语言模型在广泛任务中展现出卓越能力,但其在地理与地理空间领域的知识与能力尚未得到系统探索——尽管这些能力可能对导航、环境研究、城市发展和灾害响应产生重要裨益。我们设计了一系列实验来探究该领域内多模态大语言模型的多种视觉能力,重点聚焦前沿模型GPT-4V,并以其为基准评估开源模型的性能表现。实验方法包括用包含多类视觉任务的小规模地理基准测试集对模型进行挑战,检验其在不同复杂度任务中的能力。分析结果不仅揭示了此类模型的优势领域(包括某些超越人类表现的场景),也指出了其局限性,从而为该领域能力提供了全面评估。为便于未来模型的比较与评估,本研究所用基准测试集将公开发布。