Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e., "hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset
翻译:软件广泛应用于日常生活中的关键领域,确保其正确性至关重要。评估软件正确性的一种常用方法是通过测试用例进行验证。若测试失败,则表明被测软件存在缺陷;若所有测试均通过,则可推定软件正确。然而,该结论的可靠性取决于所采用的测试集,且存在假阴性风险(即软件通过所有可用测试却因未覆盖特定场景而包含错误)。因此,在评估软件时考虑能够诱发错误的测试用例具有重要意义。为支持数据驱动的测试集构建(这对测试基于大语言模型生成的软件尤为重要),我们构建了包含编程问题及对应错误诱发测试用例(即“破解测试”)的数据集(Codehacks)。该数据集采集自真实环境,特别是Codeforces在线评测平台。数据集涵盖5,578个编程问题的288,617个破解测试,每个问题均附有自然语言描述,同时包含2,196个可被对应破解测试攻破的已提交解决方案源代码。关键词:编程竞赛,语言模型,数据集