Current code generation benchmarks measure functional correctness on well-formed inputs, as test cases are curated to satisfy input preconditions. This leaves a gap: generated programs may appear correct but fail to satisfy contracts -- assertion-level validity constraints for rejecting ill-formed inputs. We introduce ContractEval, a benchmark for evaluating contract-satisfying assertions in code generation, i.e., whether code rejects contract-violating inputs by triggering intended assertions. Built on HumanEval+ and MBPP+, ContractEval augments each task with contract-violation tests derived from reference assertions. We synthesize these via a neuro-symbolic pipeline: an LLM converts assertion clauses into constraints, and an SMT solver enumerates satisfiable violation combinations to generate inputs that violate selected clauses while satisfying the rest. Across five code LLMs, standard prompting yields 0% contract satisfaction, while adding a few contract-violation examples boosts contract satisfaction to 49--53% while maintaining pass@1 by 92% of the original. Our code is available at https://github.com/suhanmen/ContractEval.
翻译:当前代码生成基准主要衡量程序在格式良好输入下的功能正确性,因为测试用例通常被设计为满足输入前提条件。这导致了一个空白:生成的程序可能看似正确,却未能满足契约——即用于拒绝格式不良输入的断言级有效性约束。我们提出了ContractEval,一个用于评估代码生成中契约满足性断言的基准,即检验代码是否通过触发预期断言来拒绝违反契约的输入。该基准基于HumanEval+和MBPP+构建,通过从参考断言派生的契约违反测试对每个任务进行增强。我们采用神经符号化流程合成这些测试:首先使用大语言模型将断言子句转换为约束条件,随后通过可满足性模理论求解器枚举可满足的违反组合,从而生成违反选定子句同时满足其余子句的输入。在五个代码大语言模型的测试中,标准提示方法产生的契约满足率为0%,而添加少量契约违反示例可将契约满足率提升至49–53%,同时保持原始任务通过率(pass@1)的92%。我们的代码发布于https://github.com/suhanmen/ContractEval。