LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning-ready feature space. The resulting dataset captures key characteristics of real-world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS-MEGA for data-driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self-supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty-aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS-MEGA is released with open access and is accompanied by composable, agent-friendly APIs that support structured querying and data-driven workflows.

翻译：理解土壤对于农业、碳循环和环境可持续性至关重要，但受限于碎片化和异质性数据集，当前建模工作主要局限于小尺度预测场景，难以实现高维表征学习。本文提出LUCAS-MEGA——一个通过系统性融合欧洲土壤-环境观测数据构建的大规模多模态数据集，以LUCAS调查为核心数据源。融合后的数据集包含超过7万个样本和1000余个特征，涵盖物理、化学、环境、生物和视觉属性，聚合自68个源数据集。为实现规模化数据集成，我们开发了SoilFuser——一种多智能体、人在环路的数据融合管线，可标准化异构数据格式与测量协议，解决数据不一致和无效条目问题（例如单位不一致、编码手册不匹配及错误数值），整合自然语言注释，并将多模态属性与元数据统一为适用于机器学习的特征空间。所得数据集不仅呈现真实世界土壤观测的关键特征，包括多模态性、特征覆盖不均及异质性不确定性，还验证了其在数据驱动建模中的实用性。我们采用基于特征掩码的自监督目标预训练多模态表格Transformer（SoilFormer），实现了稳定训练、强预测性能及支持不确定性感知预测的表征。此外，学习到的表征可恢复与已知土壤过程一致的关联关系。LUCAS-MEGA以开放获取形式发布，并提供支持结构化查询与数据驱动工作流的可组合、智能体友好型API接口。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《用于水文建模应用的美国空军全球空陆天气开发模型数据流程：GALWEM采集系统v1.0与v2.0概述》最新报告

专知会员服务

18+阅读 · 2025年12月27日

【NeurIPS2025】MIDAS：一种基于错配的用于失衡多模态学习的数据增强策略

专知会员服务

10+阅读 · 2025年10月1日

【马毅老师新书】学习数据分布的深层表征，304页pdf

专知会员服务

38+阅读 · 2025年8月27日

LargeAD：面向自动驾驶的大规模跨传感器数据预训练

专知会员服务

17+阅读 · 2025年1月8日