Large pre-trained models, also known as foundation models (FMs), are trained in a task-agnostic manner on large-scale data and can be adapted to a wide range of downstream tasks by fine-tuning, few-shot, or even zero-shot learning. Despite their successes in language and vision tasks, we have yet seen an attempt to develop foundation models for geospatial artificial intelligence (GeoAI). In this work, we explore the promises and challenges of developing multimodal foundation models for GeoAI. We first investigate the potential of many existing FMs by testing their performances on seven tasks across multiple geospatial subdomains including Geospatial Semantics, Health Geography, Urban Geography, and Remote Sensing. Our results indicate that on several geospatial tasks that only involve text modality such as toponym recognition, location description recognition, and US state-level/county-level dementia time series forecasting, these task-agnostic LLMs can outperform task-specific fully-supervised models in a zero-shot or few-shot learning setting. However, on other geospatial tasks, especially tasks that involve multiple data modalities (e.g., POI-based urban function classification, street view image-based urban noise intensity classification, and remote sensing image scene classification), existing foundation models still underperform task-specific models. Based on these observations, we propose that one of the major challenges of developing a FM for GeoAI is to address the multimodality nature of geospatial tasks. After discussing the distinct challenges of each geospatial data modality, we suggest the possibility of a multimodal foundation model which can reason over various types of geospatial data through geospatial alignments. We conclude this paper by discussing the unique risks and challenges to develop such a model for GeoAI.
翻译:大型预训练模型,又称基础模型,以任务无关的方式在大规模数据上进行训练,并通过微调、小样本学习甚至零样本学习适应广泛的下游任务。尽管这些模型在语言和视觉任务上取得了成功,但目前尚未有研究尝试为地理空间人工智能开发基础模型。本研究探讨了开发地理空间人工智能多模态基础模型的潜力与挑战。我们首先通过测试现有模型在地理空间语义学、健康地理学、城市地理学和遥感等多个子领域的七项任务上的表现,探究了多种现有基础模型的能力。结果表明,在仅涉及文本模态的地理空间任务(如地名识别、位置描述识别及美国州/县级痴呆症时序预测)中,这些任务无关的大型语言模型在零样本或小样本学习场景下能超越特定任务的全监督模型。然而,在其他地理空间任务中,尤其是涉及多数据模态的任务(如基于兴趣点的城市功能区分类、基于街景图像的城市噪声强度分类及遥感图像场景分类),现有基础模型的表现仍不及特定任务模型。基于这些观察,我们认为开发地理空间人工智能基础模型的主要挑战之一在于处理地理空间任务的多模态特性。在讨论每种地理空间数据模态的独特挑战后,我们提出了一种能够通过地理空间对齐对多种地理空间数据进行推理的多模态基础模型的可能性。最后,通过探讨开发此类模型用于地理空间人工智能的独特风险与挑战,我们为本文作出总结。