Precise crop yield prediction is essential for improving agricultural practices and ensuring crop resilience in varying climates. Integrating weather data across the growing season, especially for different crop varieties, is crucial for understanding their adaptability in the face of climate change. In the MLCAS2021 Crop Yield Prediction Challenge, we utilized a dataset comprising 93,028 training records to forecast yields for 10,337 test records, covering 159 locations across 28 U.S. states and Canadian provinces over 13 years (2003-2015). This dataset included details on 5,838 distinct genotypes and daily weather data for a 214-day growing season, enabling comprehensive analysis. As one of the winning teams, we developed two novel convolutional neural network (CNN) architectures: the CNN-DNN model, combining CNN and fully-connected networks, and the CNN-LSTM-DNN model, with an added LSTM layer for weather variables. Leveraging the Generalized Ensemble Method (GEM), we determined optimal model weights, resulting in superior performance compared to baseline models. The GEM model achieved lower RMSE (5.55% to 39.88%), reduced MAE (5.34% to 43.76%), and higher correlation coefficients (1.1% to 10.79%) when evaluated on test data. We applied the CNN-DNN model to identify top-performing genotypes for various locations and weather conditions, aiding genotype selection based on weather variables. Our data-driven approach is valuable for scenarios with limited testing years. Additionally, a feature importance analysis using RMSE change highlighted the significance of location, MG, year, and genotype, along with the importance of weather variables MDNI and AP.
翻译:精准的作物产量预测对于改善农业实践和确保作物在不同气候条件下的适应性至关重要。整合整个生长季节的天气数据,尤其是针对不同作物品种,是理解它们应对气候变化适应能力的关键。在MLCAS2021作物产量预测挑战赛中,我们使用了包含93,028条训练记录的数据集,预测了10,337条测试记录的产量,覆盖了美国28个州和加拿大省份的159个地点,时间跨度为13年(2003-2015年)。该数据集包含了5,838种不同基因型的详细信息以及214天生长期的每日天气数据,从而实现了全面分析。作为获胜队伍之一,我们开发了两种新型卷积神经网络架构:CNN-DNN模型(结合了CNN和全连接网络)和CNN-LSTM-DNN模型(为天气变量增加了LSTM层)。利用广义集成方法,我们确定了最优模型权重,与基线模型相比表现出更优性能。在测试数据评估中,GEM模型实现了更低的均方根误差(降低5.55%至39.88%)、更低的平均绝对误差(降低5.34%至43.76%)以及更高的相关系数(提高1.1%至10.79%)。我们应用CNN-DNN模型识别不同地点和天气条件下的表现最佳基因型,从而辅助基于天气变量的基因型选择。我们的数据驱动方法对于测试年份有限的情景具有重要价值。此外,基于RMSE变化的特征重要性分析突显了地点、成熟度组、年份和基因型的重要性,同时天气变量MDNI和AP也至关重要。