LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning-ready feature space. The resulting dataset captures key characteristics of real-world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS-MEGA for data-driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self-supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty-aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS-MEGA is released with open access and is accompanied by composable, agent-friendly APIs that support structured querying and data-driven workflows.

翻译：理解土壤对于农业、碳循环和环境可持续性至关重要，但当前进展受限于碎片化和异质性的数据集，这些数据集将建模约束在小尺度预测场景中，而非高维表征学习。本文提出LUCAS-MEGA——一个通过系统性融合欧洲土壤-环境观测数据构建的大规模多模态数据集，以LUCAS调查数据为核心主干。该融合数据集包含超过70,000个样本和1,000余个特征，涵盖物理、化学、环境、生物和视觉属性，聚合自68个源数据集。为实现规模化集成，我们开发了SoilFuser——一种基于多智能体与人机协同的数据融合管线，可标准化异构数据格式与测量协议，解决不一致性与无效条目（如单位不一致、编码表不匹配和错误值），整合自然语言注释，并将多模态属性与元数据协调为统一的、机器学习就绪的特征空间。生成的数据集保留了真实土壤观测的关键特性，包括多模态性、特征覆盖不均衡性以及异质性不确定性。为验证LUCAS-MEGA在数据驱动建模中的可用性，我们采用基于特征掩码的自监督目标预训练了一个多模态表格Transformer（SoilFormer），实现了稳定训练、强预测性能以及支持不确定性感知预测的表征。进一步研究表明，学习到的表征恢复了与已知土壤过程一致的关联关系。LUCAS-MEGA以开放获取形式发布，并附带可组合的、面向智能体的友好API，支持结构化查询与数据驱动工作流。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《用于水文建模应用的美国空军全球空陆天气开发模型数据流程：GALWEM采集系统v1.0与v2.0概述》最新报告

专知会员服务

18+阅读 · 2025年12月27日

大规模多模态模型数据集、应用类别与分类学综述

专知会员服务

58+阅读 · 2024年12月25日

数据与多模态大型语言模型的协同作用综述

专知会员服务

59+阅读 · 2024年7月13日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日