A deep research agent produces a fluent scientific report in minutes; a careful reader then tries to verify the main claims and discovers the real cost is not reading, but tracing: which sentence is supported by which passage, what was ignored, and where evidence conflicts. We argue that as research generation becomes cheap, auditability becomes the bottleneck, and the dominant risk shifts from isolated factual errors to scientifically styled outputs whose claim-evidence links are weak, missing, or misleading. This perspective proposes claim-level auditability as a first-class design and evaluation target for deep research agents, distills recurring long-horizon failure modes (objective drift, transient constraints, and unverifiable inference), and introduces the Auditable Autonomous Research (AAR) standard, a compact measurement framework that makes auditability testable via provenance coverage, provenance soundness, contradiction transparency, and audit effort. We then argue for semantic provenance with protocolized validation: persistent, queryable provenance graphs that encode claim--evidence relations (including conflicts) and integrate continuous validation during synthesis rather than after publication, with practical instrumentation patterns to support deployment at scale.
翻译:深度研究代理能在数分钟内生成流畅的科学报告;而细心的读者随后试图验证其主要声明时,却发现真正的成本不在于阅读,而在于溯源:哪些句子由哪些段落支撑、哪些内容被忽略、证据在何处存在矛盾。我们认为,随着研究生成变得廉价,可审计性正成为瓶颈,主导风险已从孤立的事实错误转向科学风格输出中声明-证据链的薄弱、缺失或误导性关联。本文提出将声明级可审计性作为深度研究代理的一流设计与评估目标,提炼出常见的长期失效模式(目标漂移、瞬态约束与不可验证推理),并引入可审计自主研究标准——一个通过溯源覆盖率、溯源可靠性、矛盾透明度与审计工作量实现可测试审计的紧凑测量框架。我们进而主张采用协议化验证的语义溯源:构建持久化、可查询的溯源图以编码声明-证据关系(含冲突),在合成过程中而非发表后集成持续验证,并提供支持规模化部署的实用化检测模式。