The SV-COMP competition provides a state-of-the-art platform for evaluating software verification tools on a standardized set of verification tasks. Consequently, verifier development outcomes are influenced by the composition of program benchmarks included in SV-COMP. When expanding this benchmark corpus, it is crucial to consider whether newly added programs cause verifiers to exhibit behavior distinct from that observed on existing benchmarks. Doing so helps mitigate external threats to the validity of the competition's results. In this paper, we present the application of the ARG-V tool for automatically generating Java verification benchmarks in the SV-COMP format. We demonstrate that, on a newly generated set of 68 realistic benchmarks, all four leading Java verifiers decrease in accuracy and recall compared to their performance on the existing benchmark suite. These findings highlight the potential of ARG-V to enhance the comprehensiveness and realism of verification tool evaluation, while also providing a roadmap for verifier developers aiming to improve their tools' applicability to real-world software.
翻译:SV-COMP竞赛为在标准化验证任务集上评估软件验证工具提供了一个前沿平台。因此,验证器开发成果受到SV-COMP所包含程序基准构成的影响。在扩展该基准语料库时,必须考虑新增程序是否会导致验证器表现出与现有基准上观察到的行为不同的行为。这样做有助于降低对竞赛结果有效性的外部威胁。本文介绍了ARG-V工具在自动生成SV-COMP格式Java验证基准方面的应用。我们证明,在新生成的68个真实基准测试集上,所有四个主流Java验证器的准确率和召回率均较其在现有基准套件上的表现有所下降。这些发现凸显了ARG-V在提升验证工具评估的全面性与真实性方面的潜力,同时也为致力于提高其工具对现实世界软件适用性的验证器开发者提供了路线图。