The rise of spatiotemporal data and the need for efficient geospatial modeling have spurred interest in automating these tasks with large language models (LLMs). However, general LLMs often generate errors in geospatial code due to a lack of domain-specific knowledge on functions and operators. To address this, a retrieval-augmented generation (RAG) approach, utilizing an external knowledge base of geospatial functions and operators, is proposed. This study introduces a framework to construct such a knowledge base, leveraging geospatial script semantics. The framework includes: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Semantic Mapping (Geo-FuM). Techniques like Chain-of-Thought, TF-IDF, and the APRIORI algorithm are utilized to derive and align geospatial functions. An example knowledge base, Geo-FuB, built from 154,075 Google Earth Engine scripts, is available on GitHub. Evaluation metrics show a high accuracy, reaching 88.89% overall, with structural and semantic accuracies of 92.03% and 86.79% respectively. Geo-FuB's potential to optimize geospatial code generation through the RAG and fine-tuning paradigms is highlighted.
翻译:时空数据的兴起以及对高效地理空间建模的需求,激发了利用大语言模型(LLMs)自动化此类任务的兴趣。然而,通用大语言模型由于缺乏关于函数和算子的领域特定知识,常常在地理空间代码生成中产生错误。为解决此问题,本研究提出了一种检索增强生成(RAG)方法,该方法利用一个外部的地理空间函数与算子知识库。本研究引入了一个利用地理空间脚本语义来构建此类知识库的框架。该框架包括:函数语义框架构建(Geo-FuSE)、频繁算子组合统计(Geo-FuST)以及语义映射(Geo-FuM)。研究采用了思维链、TF-IDF和APRIORI算法等技术来推导和对齐地理空间函数。一个基于154,075个Google Earth Engine脚本构建的示例知识库Geo-FuB已在GitHub上发布。评估指标显示其具有较高的准确率,整体准确率达到88.89%,其中结构准确率和语义准确率分别为92.03%和86.79%。研究强调了Geo-FuB在通过RAG和微调范式优化地理空间代码生成方面的潜力。