Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.
翻译:自主AI研究已取得快速进展,但机器学习研究中的长周期工程仍然面临挑战:智能体必须在数小时乃至数天内,在任务理解、环境搭建、实现、实验与调试等环节维持连贯进展。我们提出AiScientist系统,该系统基于一个简洁原则实现ML研究的自主长周期工程:强长周期性能需要结构化编排与持久状态连续性。为此,AiScientist将分层编排与基于权限范围的"文件即总线"(File-as-Bus)工作空间相结合:顶层编排器(Orchestrator)通过简洁摘要与工作空间地图维护阶段级控制,而专业智能体则反复基于持久化产物(如分析报告、计划、代码与实验证据)重新建立认知基础,而非主要依赖对话式交接,从而以稀疏控制实现厚重状态管理。在两个互补基准测试中,AiScientist在PaperBench上的平均得分较最优匹配基线提升10.54分,在MLE-Bench Lite上达到81.82%的任意奖牌率。消融实验进一步表明,"文件即总线"协议是性能的关键驱动因素:移除该协议后PaperBench下降6.41分,MLE-Bench Lite下降31.82分。这些结果表明,长周期ML研究工程本质上是关于在持久项目状态下协调专业工作的系统性问题,而非纯粹局部推理问题。