Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper presents an empirical evaluation of automated ACSL annotation generation strategies for C programs, comparing a rule-based Python script, Frama-C's RTE plugin, and three large language models (DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct). The study focuses on one-shot annotation generation, assessing how these approaches perform when directly applied to verification tasks. Using a filtered subset of the CASP benchmark, we evaluate generated annotations through Frama-C's WP plugin with multiple SMT solvers, analyzing proof success rates, solver timeouts, and internal processing time. Our results show that rule-based approaches remain more reliable for verification success, while LLM-based methods exhibit more variable performance. These findings highlight both the current limitations and the potential of LLMs as complementary tools for automated specification generation.
翻译:形式化规约对于构建可验证且可靠的软件系统至关重要,然而为现实世界中的C程序生成准确且可验证的规约仍具挑战。本文对C程序的自动化ACSL注释生成策略进行了实证评估,比较了基于规则的Python脚本、Frama-C的RTE插件以及三种大型语言模型(DeepSeek-V3.2、GPT-5.2和OLMo 3.1 32B Instruct)。本研究聚焦于单次注释生成,评估这些方法直接应用于验证任务时的表现。我们利用CASP基准测试的过滤子集,通过Frama-C的WP插件结合多个SMT求解器对生成的注释进行评估,分析了证明成功率、求解器超时时间及内部处理时间。结果表明,基于规则的方法在验证成功方面更为可靠,而基于LLM的方法则表现出更大的性能波动。这些发现既揭示了当前LLM作为自动规约生成补充工具的局限性,也展现了其潜在价值。