FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.

翻译：科学知识库通过将原始文献中的研究成果整理为结构化、可查询的格式，为人类研究者和新兴人工智能系统加速科学发现。维护这些资源需要专家策展人检索相关论文、整合跨文档证据并生成基于本体的标注——这一工作流程是现有基准测试（侧重于命名实体识别或关系抽取等孤立子任务）所未能涵盖的。我们提出FlyBench基准，用于评估人工智能代理在科学文献端到端自主本体构建任务上的表现。给定基因符号后，代理必须从包含16,898篇全文论文的语料库中进行检索与阅读，进而生成结构化标注：包括描述功能的基因本体术语、表达模式以及连接数十年命名历史的同义词。该基准包含从果蝇知识库FlyBase中选取的100个基因对应的7,397条专家标注。我们评估了四种基线代理架构：记忆式、固定流水线式、单代理式与多代理式。研究发现架构选择显著影响性能，多代理设计优于简单架构，但扩展骨干模型带来的收益呈递减趋势。所有基线模型仍有大幅改进空间。我们的分析揭示了若干指导未来发展的发现；例如，代理主要利用检索来确认参数化知识而非发现新信息。我们希望FlyBench能够推动检索增强的科学推理能力的发展，这种能力在科学领域具有广泛的应用前景。