SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

Jiyong Rao,Yicheng Qiu,Jiahui Zhang,Juntao Deng,Shangquan Sun,Fenghua Ling,Hao Chen,Nanqing Dong,Zhangyang Gao,Siqi Sun,Yuqiang Li,Dongzhan Zhou,Guangyu Wang,Lijun Wu,Conghui He,Xuhong Wang,Jing Shao,Xiang Liu,Yu Zhu,Mianxin Liu,Qihao Zheng,Yinghui Zhang,Jiamin Wu,Xiaosong Wang,Shixiang Tang,Wenlong Zhang,Bo Zhang,Wanli Ouyang,Runkai Zhao,Chunfeng Song,Lei Bai,Chi Zhang

The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready data paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this idea, we propose SciDataCopilot, an autonomous agentic framework designed to handle data ingestion, scientific intent parsing, and multi-modal integration in a end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30$\times$ speedup in data preparation.

翻译：当前科学人工智能（AI4S）的发展主要立足于大规模文本语料库，生成式AI系统在假设生成、文献检索与多模态推理方面表现卓越。然而，加速闭环科学发现的关键瓶颈仍在于原始实验数据的利用。原始数据具有高度异构性、领域专属性及深层专业知识需求等特征，既无法与语言表征直接语义对齐，也缺乏适用于统一嵌入空间的结构同质性。这种脱节阻碍了新兴的科学通用人工智能（AGI4S）与实验物理现实的有效对接。本研究将文本中心的“AI就绪”概念拓展至“科学AI就绪”数据范式，明确形式化科学数据在计算工作流中的规范定义、结构组织与组合方式。为实现该理念，我们提出SciDataCopilot——一个端到端的自主智能体框架，用于处理数据摄取、科学意图解析与多模态集成。通过将数据就绪性确立为核心操作原语，该框架为可复用、可迁移的系统提供了原则性基础，从而推动向实验驱动的科学通用智能过渡。在三个异构科学领域的广泛评估表明，SciDataCopilot相较于人工流程在效率、可扩展性与一致性方面均有显著提升，数据准备速度最高可加速30倍。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

PaperOrchestra：一种面向自动化 AI 学术论文撰写的多智能体框架

专知会员服务

13+阅读 · 4月9日

《科研智能发展报告（2025年）》发布

专知会员服务

32+阅读 · 1月14日

【AI4Science】面向分子科学的数据智能，13页pdf

专知会员服务

29+阅读 · 2023年6月19日

【ChatGPT系列报告】人工智能行业专题报告：多模态AI研究框架，17页ppt

专知会员服务

105+阅读 · 2023年4月7日