Generative AI including large language models (LLMs) have recently gained significant interest in the geo-science community through its versatile task-solving capabilities including coding, spatial computations, generation of sample data, time-series forecasting, toponym recognition, or image classification. So far, the assessment of LLMs for spatial tasks has primarily focused on ChatGPT, arguably the most prominent AI chatbot, whereas other chatbots received less attention. To narrow this research gap, this study evaluates the correctness of responses for a set of 54 spatial tasks assigned to four prominent chatbots, i.e., ChatGPT-4, Bard, Claude-2, and Copilot. Overall, the chatbots performed well on spatial literacy, GIS theory, and interpretation of programming code and given functions, but revealed weaknesses in mapping, code generation, and code translation. ChatGPT-4 outperformed other chatbots across most task categories.
翻译:生成式人工智能,包括大语言模型,近期因其在编程、空间计算、样本数据生成、时间序列预测、地名识别或图像分类等多功能任务求解能力,在地球科学领域引起了广泛兴趣。迄今,针对大语言模型在空间任务中的评估主要聚焦于ChatGPT(可以说是最著名的AI聊天机器人),而其他聊天机器人受到的关注较少。为填补这一研究空白,本研究评估了四个著名聊天机器人(即ChatGPT-4、Bard、Claude-2和Copilot)对一组54个空间任务回答的正确性。总体而言,这些聊天机器人在空间素养、GIS理论以及编程代码和给定函数的解读方面表现良好,但在制图、代码生成和代码翻译方面暴露出不足。ChatGPT-4在大多数任务类别中优于其他聊天机器人。