Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning-ready feature space. The resulting dataset captures key characteristics of real-world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS-MEGA for data-driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self-supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty-aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS-MEGA is released with open access and is accompanied by composable, agent-friendly APIs that support structured querying and data-driven workflows.
翻译:理解土壤对于农业、碳循环和环境可持续性至关重要,但当前进展受限于碎片化和异质性的数据集,这些数据集将建模约束在小尺度预测场景中,而非高维表征学习。本文提出LUCAS-MEGA——一个通过系统性融合欧洲土壤-环境观测数据构建的大规模多模态数据集,以LUCAS调查数据为核心主干。该融合数据集包含超过70,000个样本和1,000余个特征,涵盖物理、化学、环境、生物和视觉属性,聚合自68个源数据集。为实现规模化集成,我们开发了SoilFuser——一种基于多智能体与人机协同的数据融合管线,可标准化异构数据格式与测量协议,解决不一致性与无效条目(如单位不一致、编码表不匹配和错误值),整合自然语言注释,并将多模态属性与元数据协调为统一的、机器学习就绪的特征空间。生成的数据集保留了真实土壤观测的关键特性,包括多模态性、特征覆盖不均衡性以及异质性不确定性。为验证LUCAS-MEGA在数据驱动建模中的可用性,我们采用基于特征掩码的自监督目标预训练了一个多模态表格Transformer(SoilFormer),实现了稳定训练、强预测性能以及支持不确定性感知预测的表征。进一步研究表明,学习到的表征恢复了与已知土壤过程一致的关联关系。LUCAS-MEGA以开放获取形式发布,并附带可组合的、面向智能体的友好API,支持结构化查询与数据驱动工作流。