Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not yet exist, and directly transplanting general-purpose benchmarks such as SWE-bench fails on three structural fronts: (i) MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic; (ii) high-value MPC fixes lack the standardized tests rigid extraction pipelines require; and (iii) standard fail-to-pass evaluation is insufficient for code that must also be cryptographically safe. MPC is increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics. Existing MPC-specific code-synthesis efforts cover only operator-level or single-framework tasks; evaluating LLM agents on real repository-level MPC repair instead demands MPC-aware data curation and a verifier matched to the security and numerical-fidelity guarantees MPC programs must obey neither of which existing benchmarks provide. We introduce MPC-Patch-Bench, a repository-level benchmark organised around two frameworks. (1)The Data Curation Framework combines a domain-specific curation agent that filters raw pull requests through three cryptographic layers with a human-AI completion engine that synthesizes missing problem statements and Fail-to-Pass/Pass-to-Pass tests, yielding 205 fully verified instances. (2)The MPC Verifier provides dedicated security and numerical-fidelity checks via dynamic differential testing against plaintext oracles and MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks; the MPC Verifier further reduces verified resolution to 17.1%, with up to 40% of functionally-passing patches rejected for cryptographic or numerical-fidelity violations.
翻译:针对安全多方计算(MPC)软件的大型语言模型(LLM)代码修复评估的仓库级基准尚不存在,直接移植如SWE-bench等通用基准在三个结构层面存在缺陷:(i)MPC仓库以通用Python基础设施为主,而非密码学逻辑;(ii)高价值MPC修复缺乏严格提取流程所需的标准化测试;(iii)标准“失败转通过”评估对于必须保证密码学安全的代码不充分。MPC正越来越多地应用于隐私保护机器学习、生物医学协作和安全分析等领域。现有MPC特定代码合成工作仅覆盖算子级或单框架任务;要在真实仓库级MPC修复中评估LLM智能体,需要MPC感知的数据整理工具以及一个匹配MPC程序必须遵守的安全性和数值保真度保证的验证器——现有基准均未提供两者。我们提出MPC-Patch-Bench,一个围绕两个框架组织的仓库级基准。(1)数据整理框架结合了领域特定整理智能体(通过三层密码学层过滤原始拉取请求)与人类-人工智能补全引擎(合成缺失的问题描述与失败转通过/通过转通过测试),共生成205个完全验证的实例。(2)MPC验证器通过针对明文预言机的动态差分测试与MPC特定静态分析规则(标记不安全泄露、不安全算术及非法公/私有类型转换)提供专门的安全性与数值保真度检查。评估中性能最强的LLM仅能功能修复22.9%的MPC-Patch-Bench任务;MPC验证器进一步将验证通过率降至17.1%,其中高达40%的功能通过补丁因密码学或数值保真度违规被拒绝。