Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.
翻译:作物产量预测需要大量数据来训练可扩展模型。然而,创建产量预测数据集受到高采集成本、数据质量不均以及数据隐私法规的限制。因此,现有数据集稀缺、质量低下,或仅限于区域层面或单一作物类型,阻碍了可扩展数据驱动解决方案的发展。本文发布了YieldSAT,这是一个大规模、高质量、多模态的高分辨率作物产量预测数据集。YieldSAT覆盖阿根廷、巴西、乌拉圭和德国等多个国家的不同气候区域,包含玉米、油菜籽、大豆和小麦等主要作物类型,涵盖2,173个专家精选地块。总共提供超过1,220万个产量样本,每个样本的空间分辨率为10米。每个地块配有多光谱卫星图像,共计113,555张标注卫星图像,并辅以环境辅助数据。我们通过比较多种深度学习模型和数据融合架构,展示了大规模高分辨率作物产量预测作为像素回归任务的潜力。此外,我们强调了实际条件下基于真值数据严重分布偏移所带来的开放挑战。为缓解这一问题,我们探索了一种领域引导的深度集成方法,该方法表现出显著的性能提升。数据集可在https://yieldsat.github.io/获取。