The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study addresses this gap by introducing a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding and path planning, each with verified, accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4o, and ZhipuAI's glm-4, through a two-phase testing approach. Initially, we conducted zero-shot testing, followed by categorizing the dataset by difficulty and performing prompt tuning tests. Results indicate that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it surpassed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For example, the Chain-of-Thought (COT) strategy increased gpt-4o's accuracy in path planning from 12.4% to 87.5%, while a one-shot strategy enhanced moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
翻译:ChatGPT、Gemini 等大语言模型的出现,凸显了评估其从自然语言理解到代码生成等多样化能力的重要性。然而,它们在空间任务上的性能尚未得到全面评估。本研究通过引入一个新颖的多任务空间评估数据集来填补这一空白,该数据集旨在系统性地探索和比较多个先进模型在空间任务上的表现。该数据集涵盖十二种不同的任务类型,包括空间理解和路径规划,每种类型均包含经过验证的准确答案。我们通过两阶段测试方法评估了多个模型,包括 OpenAI 的 gpt-3.5-turbo、gpt-4o 以及智谱 AI 的 glm-4。首先,我们进行了零样本测试,随后根据难度对数据集进行分类并进行了提示调优测试。结果表明,gpt-4o 在第一阶段取得了最高的总体准确率,平均为 71.3%。尽管 moonshot-v1-8k 总体表现稍逊,但在地名识别任务上超越了 gpt-4o。研究还强调了提示策略对模型在特定任务中性能的影响。例如,思维链(Chain-of-Thought, COT)策略将 gpt-4o 在路径规划中的准确率从 12.4% 提升至 87.5%,而单样本(one-shot)策略则将 moonshot-v1-8k 在地图任务中的准确率从 10.1% 提升至 76.3%。