Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models--DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-based work.
翻译:形式化规约对于构建可验证且可信赖的软件系统至关重要,然而为真实C程序生成准确且可验证的规约仍具挑战性。本文通过实验评估了形式化分析工具在无需人工或基于学习辅助的情况下,自动生成并验证ACSL规约的能力。我们对近期发布的包含506个C程序的数据集开展受控研究,将其从交互式开发人员驱动的工作流程改造为自动化评估环境。比较了五种ACSL生成系统:基于规则的Python脚本、Frama-C的RTE插件,以及三个大语言模型——DeepSeek-V3.2、GPT-5.2和OLMo 3.1 32B Instruct。所有生成的规约均在使用多个SMT求解器驱动的Frama-C WP插件下,于相同条件下验证,从而可直接比较注释质量、求解器敏感度及证明稳定性。我们的结果为自动化ACSL生成的能力与局限性提供了新的实证依据,补充了先前基于调查的研究工作。