Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs.

翻译：修复由Syzkaller等内核模糊测试工具发现的系统崩溃是软件工程中至关重要但尚未充分探索的挑战。尽管近期研究引入了基于大语言模型（LLM）的智能体用于Linux内核崩溃修复，但其评估基准通常为静态数据集，无法捕捉Linux内核的持续演进特性，且因LLM知识截止点而存在潜在的数据污染问题。为解决上述问题，我们提出：(i) Live-kBench——一个自演进基准测试的评估框架，能够持续爬取新发现的内核漏洞并对智能体进行实时评估；(ii) kEnv——一个与智能体无关的标准化崩溃修复环境，提供内核编译、执行与反馈的完整工作流。该设计将智能体工作流与重型执行环境解耦，使得不同智能体框架能在相同条件下进行公平且可扩展的性能比较。基于此，我们构建了包含534个Linux内核漏洞的初始数据集，并通过实证研究揭示了显著的性能差异：对于LLM知识截止点前已修复的漏洞，智能体能够实现高达25%的等效补丁率提升。借助kEnv环境，我们对三种前沿智能体进行基准测试，结果表明：虽然首次尝试即可解决74%的崩溃案例（生成可行补丁），但仅有约20%的生成补丁与开发者修复方案高度吻合。此外，引入崩溃修复反馈机制可使崩溃解决率提升29%。Live-kBench为研究社区提供了兼具时间敏感性与属性敏感性的自演进基准测试评估基础设施，并配备公开仪表板以持续追踪智能体在Linux内核漏洞修复中的进展。