Evaluation of Code LLMs on Geospatial Code Generation

Software development support tools have been studied for a long time, with recent approaches using Large Language Models (LLMs) for code generation. These models can generate Python code for data science and machine learning applications. LLMs are helpful for software engineers because they increase productivity in daily work. An LLM can also serve as a "mentor" for inexperienced software developers, and be a viable learning support. High-quality code generation with LLMs can also be beneficial in geospatial data science. However, this domain poses different challenges, and code generation LLMs are typically not evaluated on geospatial tasks. Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks. We categorised geospatial tasks based on their complexity and required tools. Then, we created a dataset with tasks that test model capabilities in spatial reasoning, spatial data processing, and geospatial tools usage. The dataset consists of specific coding problems that were manually created for high quality. For every problem, we proposed a set of test scenarios that make it possible to automatically check the generated code for correctness. In addition, we tested a selection of existing code generation LLMs for code generation in the geospatial domain. We share our dataset and reproducible evaluation code on a public GitHub repository, arguing that this can serve as an evaluation benchmark for new LLMs in the future. Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy. These models will enable the creation of coding assistants tailored for geospatial applications.

翻译：软件开发支持工具的研究已有很长历史，近期方法采用大语言模型（LLMs）进行代码生成。这些模型能够为数据科学和机器学习应用生成Python代码。LLMs对软件工程师具有重要价值，因其能提升日常工作效率。对于经验不足的开发者，LLM还可充当"导师"角色，成为可行的学习支持工具。在高质量代码生成方面，LLMs同样能为地理空间数据科学领域带来助益。然而该领域存在独特挑战，而现有代码生成LLMs通常未针对地理空间任务进行评估。本文展示了如何基于精选的地理空间任务构建代码生成模型的评估基准。我们依据任务复杂度和所需工具对地理空间任务进行分类，进而创建了测试模型在空间推理、空间数据处理及地理空间工具使用能力的任务数据集。该数据集由人工精心设计的特定编程问题构成，每个问题均配备可自动验证生成代码正确性的测试场景集。此外，我们对现有代码生成LLMs在地理空间领域的代码生成能力进行了测试。我们在公共GitHub仓库中开源了数据集和可复现的评估代码，主张其可作为未来新LLMs的评估基准。本数据集有望推动开发能够高精度解决地理空间编码任务的新型模型，这些模型将促进定制化地理空间应用编码助手的创建。