Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.
翻译:摘要:大型语言模型在自然语言理解、对话系统和代码生成等广泛任务中展现了卓越性能。尽管进展显著,但迄今为止,对其在处理陈述性范式(如回答集编程)方面的有效性关注较少。本文提出BLAST:首个专用于评估大型语言模型生成ASP代码准确性的基准评估方法及相关数据集。BLAST提供结构化评估框架,包含针对ASP代码生成的两项新型语义指标。本文呈现了一项实证评估结果,涉及ASP文献中十个经典图相关问题和八种前沿大型语言模型的多样化组合。