The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program $P$ and an AI-evolved program $P'$, AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.
翻译:计算机系统领域最近对AI驱动的系统进化兴趣日益增长,即AI代理迭代地重写系统。诸如AdaEvolve和Engram等框架报告称,与人工设计的算法相比,得分提升了12-60%。虽然这些结果令人鼓舞,但存在实际担忧:这些AI进化程序可能在未见工作负载上表现更差,并出现可扩展性退化。鉴于AI生成代码的速度和规模,我们需要自动化机制来揭示AI进化系统程序中的此类隐藏弱点。为此,我们开发了AIChilles,它以基线程序$P$和AI进化程序$P'$为输入,搜索有效的负载,其中$P'$在正确性、运行时间、内存使用或输出质量方面相对于$P$出现退化。为了应对系统应用、弱点类型和潜在bug的多样性,AIChilles结合了确定性负载参数提取、基于代理的约束推理、差异oracle和代码频率覆盖来发现多样化的故障。在五个系统应用和30个AI进化程序上,AIChilles发现了49个不同的隐藏弱点。我们还表明,在AI驱动的开发生命周期中显式包含AIChilles可以缓解其中的几个弱点。