Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning-ready feature space. The resulting dataset captures key characteristics of real-world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS-MEGA for data-driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self-supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty-aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS-MEGA is released with open access and is accompanied by composable, agent-friendly APIs that support structured querying and data-driven workflows.
翻译:理解土壤对于农业、碳循环和环境可持续性至关重要,但受限于碎片化和异质性数据集,当前建模工作主要局限于小尺度预测场景,难以实现高维表征学习。本文提出LUCAS-MEGA——一个通过系统性融合欧洲土壤-环境观测数据构建的大规模多模态数据集,以LUCAS调查为核心数据源。融合后的数据集包含超过7万个样本和1000余个特征,涵盖物理、化学、环境、生物和视觉属性,聚合自68个源数据集。为实现规模化数据集成,我们开发了SoilFuser——一种多智能体、人在环路的数据融合管线,可标准化异构数据格式与测量协议,解决数据不一致和无效条目问题(例如单位不一致、编码手册不匹配及错误数值),整合自然语言注释,并将多模态属性与元数据统一为适用于机器学习的特征空间。所得数据集不仅呈现真实世界土壤观测的关键特征,包括多模态性、特征覆盖不均及异质性不确定性,还验证了其在数据驱动建模中的实用性。我们采用基于特征掩码的自监督目标预训练多模态表格Transformer(SoilFormer),实现了稳定训练、强预测性能及支持不确定性感知预测的表征。此外,学习到的表征可恢复与已知土壤过程一致的关联关系。LUCAS-MEGA以开放获取形式发布,并提供支持结构化查询与数据驱动工作流的可组合、智能体友好型API接口。