MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not yet exist, and directly transplanting general-purpose benchmarks such as SWE-bench fails on three structural fronts: (i) MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic; (ii) high-value MPC fixes lack the standardized tests rigid extraction pipelines require; and (iii) standard fail-to-pass evaluation is insufficient for code that must also be cryptographically safe. MPC is increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics. Existing MPC-specific code-synthesis efforts cover only operator-level or single-framework tasks; evaluating LLM agents on real repository-level MPC repair instead demands MPC-aware data curation and a verifier matched to the security and numerical-fidelity guarantees MPC programs must obey neither of which existing benchmarks provide. We introduce MPC-Patch-Bench, a repository-level benchmark organised around two frameworks. (1)The Data Curation Framework combines a domain-specific curation agent that filters raw pull requests through three cryptographic layers with a human-AI completion engine that synthesizes missing problem statements and Fail-to-Pass/Pass-to-Pass tests, yielding 205 fully verified instances. (2)The MPC Verifier provides dedicated security and numerical-fidelity checks via dynamic differential testing against plaintext oracles and MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks; the MPC Verifier further reduces verified resolution to 17.1%, with up to 40% of functionally-passing patches rejected for cryptographic or numerical-fidelity violations.

翻译：针对安全多方计算（MPC）软件的大型语言模型（LLM）代码修复评估的仓库级基准尚不存在，直接移植如SWE-bench等通用基准在三个结构层面存在缺陷：（i）MPC仓库以通用Python基础设施为主，而非密码学逻辑；（ii）高价值MPC修复缺乏严格提取流程所需的标准化测试；（iii）标准“失败转通过”评估对于必须保证密码学安全的代码不充分。MPC正越来越多地应用于隐私保护机器学习、生物医学协作和安全分析等领域。现有MPC特定代码合成工作仅覆盖算子级或单框架任务；要在真实仓库级MPC修复中评估LLM智能体，需要MPC感知的数据整理工具以及一个匹配MPC程序必须遵守的安全性和数值保真度保证的验证器——现有基准均未提供两者。我们提出MPC-Patch-Bench，一个围绕两个框架组织的仓库级基准。（1）数据整理框架结合了领域特定整理智能体（通过三层密码学层过滤原始拉取请求）与人类-人工智能补全引擎（合成缺失的问题描述与失败转通过/通过转通过测试），共生成205个完全验证的实例。（2）MPC验证器通过针对明文预言机的动态差分测试与MPC特定静态分析规则（标记不安全泄露、不安全算术及非法公/私有类型转换）提供专门的安全性与数值保真度检查。评估中性能最强的LLM仅能功能修复22.9%的MPC-Patch-Bench任务；MPC验证器进一步将验证通过率降至17.1%，其中高达40%的功能通过补丁因密码学或数值保真度违规被拒绝。