AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems

The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program $P$ and an AI-evolved program $P'$, AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.

翻译：计算机系统领域最近对AI驱动的系统进化兴趣日益增长，即AI代理迭代地重写系统。诸如AdaEvolve和Engram等框架报告称，与人工设计的算法相比，得分提升了12-60%。虽然这些结果令人鼓舞，但存在实际担忧：这些AI进化程序可能在未见工作负载上表现更差，并出现可扩展性退化。鉴于AI生成代码的速度和规模，我们需要自动化机制来揭示AI进化系统程序中的此类隐藏弱点。为此，我们开发了AIChilles，它以基线程序$P$和AI进化程序$P'$为输入，搜索有效的负载，其中$P'$在正确性、运行时间、内存使用或输出质量方面相对于$P$出现退化。为了应对系统应用、弱点类型和潜在bug的多样性，AIChilles结合了确定性负载参数提取、基于代理的约束推理、差异oracle和代码频率覆盖来发现多样化的故障。在五个系统应用和30个AI进化程序上，AIChilles发现了49个不同的隐藏弱点。我们还表明，在AI驱动的开发生命周期中显式包含AIChilles可以缓解其中的几个弱点。

相关内容

关注 7111

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

BES：让语言模型通过双向进化搜索自我改进

专知会员服务

9+阅读 · 5月30日

AutoScientists：自组织智能体团队驱动长期科学实验

专知会员服务

12+阅读 · 5月29日

AutoResearch AI综述：迈向AI驱动的科学发现自动化

专知会员服务

18+阅读 · 5月26日

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

18+阅读 · 5月20日