As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red teamers and found that LLM-generated sabotages were weaker than human-generated ones but still sometimes evaded same-capability LLM auditors. We release ASMR-Bench to support research on monitoring and auditing techniques for AI-conducted research.
翻译:随着人工智能系统越来越多地被用于自主开展研究,未对齐的系统可能引入细微缺陷,在逃避检测的同时产生误导性结果。我们提出ASMR-Bench(机器学习科研中的破坏行为审计基准),该基准用于评估审计人员在机器学习科研代码库中检测破坏行为的能力。ASMR-Bench包含9个机器学习科研代码库及其对应变体,这些变体通过修改超参数、训练数据或评估代码等实现细节,在保持论文所述高层方法论不变的前提下,产生性质不同的实验结果。我们评估了前沿大语言模型及大语言模型辅助的人类审计人员在ASMR-Bench上的表现,发现二者均难以可靠地检测破坏行为:最佳性能由Gemini 3.1 Pro实现,其AUROC为0.77,Top-1修复率为42%。我们同时测试了大语言模型作为红队攻击者的能力,发现其生成的破坏行为虽弱于人类生成案例,但仍能偶尔逃逸同等能力的大语言模型审计。我们发布ASMR-Bench以支持面向AI自主科研的监控与审计技术研究。