Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports replication, verification, and validation? We address this question by evaluating 17 contemporary LLMs on a controlled ODD-to-code translation task, using the PPHPC predator-prey model as a fully specified reference. Generated Python implementations are assessed through staged executability checks, model-independent statistical comparison against a validated NetLogo baseline, and quantitative measures of runtime efficiency and maintainability. Results show that behaviorally faithful implementations are achievable but not guaranteed, and that executability alone is insufficient for scientific use. GPT-4.1 consistently produces statistically valid and efficient implementations, with Claude 3.7 Sonnet performing well but less reliably. Overall, the findings clarify both the promise and current limitations of LLMs as model engineering tools, with implications for reproducible agent-based and environmental modelling.
翻译:大型语言模型(LLMs)现已能够根据文本描述合成非平凡的、可执行的代码,这引发了一个重要问题:LLMs能否根据标准化规范可靠地实现基于主体的模型,以支持模型的复制、验证和确认?我们通过评估17个当代LLM在受控的ODD到代码翻译任务上的表现来探讨此问题,使用完全指定的PPHPC捕食者-猎物模型作为参考基准。生成的Python实现通过分阶段的可执行性检查、与经过验证的NetLogo基线进行的模型无关统计比较,以及运行时效率和可维护性的定量度量来评估。结果表明,行为上忠实的实现是可以达成的,但并非总能保证,并且仅具备可执行性不足以满足科学用途。GPT-4.1能够持续生成统计上有效且高效的实现,Claude 3.7 Sonnet表现良好但可靠性稍逊。总体而言,研究结果阐明了LLMs作为模型工程工具的前景与当前局限,对可复现的基于主体建模和环境建模具有启示意义。