Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

Autonomous research systems increasingly make the scientific workflow executable: agents can propose ideas, run code, inspect results, and draft papers. But executable workflows do not by themselves produce research judgment. We analyze where current systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains textual, and recurring process failures do not change later behavior. We introduce Sibyl-AutoResearch, a self-evolving AutoResearch framework built around Scientific Trial-and-Error Harnesses. A harness lets agents run bounded trials, preserve positive and negative outcomes, and route lessons into later planning, validation, claim scope, scheduling, critique, writing, and harness repair. We formalize this through two auditable conversion units: trial-to-behavior conversion, which links trial signals to later research actions, and trial-to-harness-behavior conversion, which links recurring process failures to system updates. We implement the framework in SIBYL, a file-backed autonomous research system that exposes the state, roles, memory, gates, and artifact traces needed to inspect these conversion paths. A retrospective audit identifies eight high-confidence conversion events, with a median latency of one iteration and a maximum latency of three iterations. A recovered-failure registry further shows how five naturally occurring failure classes, including duplicate results, stale numbers, and unsupported statistics, were blocked, downgraded, or routed into later repair. These traces do not establish a comparative performance claim; they show that the proposed conversion units are recoverable from realistic autonomous-research workspaces. The SIBYL framework and system are available at https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem.

翻译：[中文摘要] 自主研究系统越来越多地使科学工作流程可执行：智能体可以提出想法、运行代码、检查结果并撰写论文。但可执行的工作流程本身并不会产生研究判断力。我们分析了当前系统在哪些环节丢失了试验经验：薄弱证据被转化为花哨表述、初步信号被夸大为广泛结论、记忆仍停留在文本层面、重复出现的流程故障无法改变后续行为。为此我们提出Sibyl-AutoResearch，一个围绕科学试错框架构建的自我进化自主研究系统。该框架允许智能体运行受限试验、保存正负结果、并将经验教训传导至后续的规划、验证、主张范围、调度、审校、写作及框架修复环节。我们通过两个可审计的转换单元对此进行形式化：试错到行为转换（将试验信号链接至后续研究行动）与试错到框架行为转换（将重复性流程故障链接至系统更新）。我们在SIBYL系统中实现该框架——一个基于文件系统的自主研究系统，通过暴露状态、角色、记忆、门控和工件轨迹来检查这些转换路径。回溯审计识别出八个高置信度转换事件，中位延迟为一个迭代周期，最大延迟为三个迭代周期。故障恢复日志进一步展示了五种自然发生的故障类别——包括重复结果、过期数据及无依据统计——如何被阻断、降级或导向后续修复。这些轨迹记录不构成性能对比主张，但证明所提出的转换单元可从真实自主研究工作空间中恢复。SIBYL框架与系统已在 https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem 开源。