YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction

Miro Miranda,Deepak Pathak,Patrick Helber,Benjamin Bischke,Hiba Najjar,Francisco Mena,Cristhian Sanchez,Akshay Pai,Diego Arenas,Matias Valdenegro-Toro,Marcela Charfuelan,Marlon Nuske,Andreas Dengel

Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.

翻译：作物产量预测需要大量数据来训练可扩展模型。然而，创建产量预测数据集受到高采集成本、数据质量不均以及数据隐私法规的限制。因此，现有数据集稀缺、质量低下，或仅限于区域层面或单一作物类型，阻碍了可扩展数据驱动解决方案的发展。本文发布了YieldSAT，这是一个大规模、高质量、多模态的高分辨率作物产量预测数据集。YieldSAT覆盖阿根廷、巴西、乌拉圭和德国等多个国家的不同气候区域，包含玉米、油菜籽、大豆和小麦等主要作物类型，涵盖2,173个专家精选地块。总共提供超过1,220万个产量样本，每个样本的空间分辨率为10米。每个地块配有多光谱卫星图像，共计113,555张标注卫星图像，并辅以环境辅助数据。我们通过比较多种深度学习模型和数据融合架构，展示了大规模高分辨率作物产量预测作为像素回归任务的潜力。此外，我们强调了实际条件下基于真值数据严重分布偏移所带来的开放挑战。为缓解这一问题，我们探索了一种领域引导的深度集成方法，该方法表现出显著的性能提升。数据集可在https://yieldsat.github.io/获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《大语言模型的数据合成与增强综述》

专知会员服务

44+阅读 · 2024年10月19日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日

大模型如何做药物发现？Mila等30多位作者发布《基础模型分子学习: 大规模多任务数据集》

专知会员服务

28+阅读 · 2023年10月10日