Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains, particularly focusing on the frontier model GPT-4V, and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks, testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel, including instances where they outperform humans, but also where they falter, providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models, our benchmark will be publicly released.
翻译:多模态大语言模型在广泛任务中展现出显著能力,但其在地理与地理空间领域的知识和应用潜力——包括导航、环境研究、城市发展及灾害响应等——仍有待系统探索。我们通过一系列实验,重点探究前沿模型GPT-4V在该领域的视觉能力,并将其与开源模型进行基准对比。研究方法采用涵盖多层级视觉任务的小型地理基准测试集,在复杂度梯度上评估模型表现。分析不仅揭示了模型(包括超越人类表现)的优势场景,也指出了其局限性,从而构建了地理领域能力的均衡视图。该基准测试集将公开发布,以支持未来模型的比较与评估。