The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready data paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this idea, we propose SciDataCopilot, an autonomous agentic framework designed to handle data ingestion, scientific intent parsing, and multi-modal integration in a end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30$\times$ speedup in data preparation.
翻译:当前科学人工智能(AI4S)的发展主要立足于大规模文本语料库,生成式AI系统在假设生成、文献检索与多模态推理方面表现卓越。然而,加速闭环科学发现的关键瓶颈仍在于原始实验数据的利用。原始数据具有高度异构性、领域专属性及深层专业知识需求等特征,既无法与语言表征直接语义对齐,也缺乏适用于统一嵌入空间的结构同质性。这种脱节阻碍了新兴的科学通用人工智能(AGI4S)与实验物理现实的有效对接。本研究将文本中心的“AI就绪”概念拓展至“科学AI就绪”数据范式,明确形式化科学数据在计算工作流中的规范定义、结构组织与组合方式。为实现该理念,我们提出SciDataCopilot——一个端到端的自主智能体框架,用于处理数据摄取、科学意图解析与多模态集成。通过将数据就绪性确立为核心操作原语,该框架为可复用、可迁移的系统提供了原则性基础,从而推动向实验驱动的科学通用智能过渡。在三个异构科学领域的广泛评估表明,SciDataCopilot相较于人工流程在效率、可扩展性与一致性方面均有显著提升,数据准备速度最高可加速30倍。