Generative AI including large language models (LLMs) has recently gained significant interest in the geo-science community through its versatile task-solving capabilities including programming, arithmetic reasoning, generation of sample data, time-series forecasting, toponym recognition, or image classification. Most existing performance assessments of LLMs for spatial tasks have primarily focused on ChatGPT, whereas other chatbots received less attention. To narrow this research gap, this study conducts a zero-shot correctness evaluation for a set of 76 spatial tasks across seven task categories assigned to four prominent chatbots, i.e., ChatGPT-4, Gemini, Claude-3, and Copilot. The chatbots generally performed well on tasks related to spatial literacy, GIS theory, and interpretation of programming code and functions, but revealed weaknesses in mapping, code writing, and spatial reasoning. Furthermore, there was a significant difference in correctness of results between the four chatbots. Responses from repeated tasks assigned to each chatbot showed a high level of consistency in responses with matching rates of over 80% for most task categories in the four chatbots.
翻译:生成式人工智能(包括大语言模型)凭借其编程、算术推理、样本数据生成、时间序列预测、地名识别或图像分类等多方面的任务解决能力,近期在地球科学界引起了广泛关注。现有针对大语言模型在空间任务中的性能评估主要集中于ChatGPT,其他聊天机器人则较少受到关注。为填补这一研究空白,本研究对ChatGPT-4、Gemini、Claude-3和Copilot这四款主流聊天机器人,在涵盖七个任务类别的76项空间任务上进行了零样本正确性评估。这些聊天机器人在空间素养、GIS理论以及编程代码与函数解读相关任务上普遍表现良好,但在制图、代码编写和空间推理方面显示出不足。此外,四款聊天机器人之间的结果正确性存在显著差异。对每个聊天机器人重复分配任务所得响应的分析表明,其回答具有高度一致性,在四个聊天机器人中,大多数任务类别的响应匹配率超过80%。