The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.
翻译:浩瀚而待探索的海洋对调节全球气候与支撑海洋生物多样性至关重要,然而受限于基础数据瓶颈,人工智能在该领域的应用迄今贡献有限。具体而言,海洋数据高度分散于异构来源,天然具有多模态、高噪声与弱标注特性,缺乏统一架构与语义对齐。尽管多模态大语言模型(MLLMs)已在通用领域取得显著成功,但其在海洋科学中的应用仍因缺乏面向海洋环境的大规模、高对齐多模态数据集而严重受限。为弥合这一鸿沟,我们提出OceanPile——专为海洋基础模型构建的大规模多模态语料库。该语料库包含三大核心组件:OceanCorpus(统一数据集,整合来自多元权威来源的声纳数据、水下影像、海洋科学可视化资料与科学文本)、OceanInstruction(通过分层式海洋概念知识图谱引导的新型流水线合成的高质量指令数据集)、以及OceanBenchmark(为严格评估而人工标注的基准测试集)。我们构建了多阶段质量控制流程以确保科学有效性与跨模态对齐。实验验证表明,基于本数据训练的模型性能提升显著。所有数据集均已公开发布,以推动海洋人工智能领域发展并赋能领域专用MLLMs。