Data-driven Lake Water Quality Forecasting for Time Series with Missing Data using Machine Learning

Volunteer-led lake monitoring yields irregular, seasonal time series with many gaps arising from ice cover, weather-related access constraints, and occasional human errors, complicating forecasting and early warning of harmful algal blooms. We study Secchi Disk Depth (SDD) forecasting on a 30-lake, data-rich subset drawn from three decades of in-situ records collected across Maine lakes. Missingness is handled via Multiple Imputation by Chained Equations (MICE), and we evaluate performance with a normalized Mean Absolute Error (nMAE) metric for cross-lake comparability. Among six candidates, ridge regression provides the best mean test performance. Using ridge regression, we then quantify the minimal sample size, showing that under a backward, recent-history protocol, the model reaches within 5% of full-history accuracy with approximately 176 training samples per lake on average. We also identify a minimal feature set, where a compact four-feature subset matches the thirteen-feature baseline within the same 5% tolerance. Bringing these results together, we introduce a joint feasibility function that identifies the minimal training history and fewest predictors sufficient to achieve the target of staying within 5% of the complete-history, full-feature baseline. In our study, meeting the 5% accuracy target required about 64 recent samples and just one predictor per lake, highlighting the practicality of targeted monitoring. Hence, our joint feasibility strategy unifies recent-history length and feature choice under a fixed accuracy target, yielding a simple, efficient rule for setting sampling effort and measurement priorities for lake researchers.

翻译：志愿者主导的湖泊监测会产生不规律且具有季节性的时间序列，其中因冰盖、天气导致的采样可达性限制及偶发人为错误而产生大量缺失值，这给有害藻华预测预警带来困难。本研究基于缅因州湖泊三十年实地观测数据，选取含30个湖泊的丰富子集开展塞氏盘深度预测。通过链式方程多重插补处理缺失数据，采用归一化平均绝对误差指标实现跨湖泊性能对比。在六种候选模型中，岭回归取得最优平均测试性能。基于岭回归模型，我们量化了最小样本量需求：在回顾近期历史数据的协议下，模型每个湖泊仅需约176个训练样本即可达到全历史数据准确度的95%。同时识别出最小特征集，其中包含四个特征的紧凑子集在5%容差范围内与十三特征基线模型表现相当。综合上述发现，我们提出联合可行性函数，可确定在达到完整历史全特征基线模型准确度95%目标时所需的最小训练历史长度与最少预测因子数量。在本次研究中，实现5%准确度目标每个湖泊仅需约64个近期样本及单个预测因子，突显了靶向监测的实用性。因此，我们的联合可行性策略在固定准确度目标下统一了近期历史长度与特征选择，为湖泊研究者制定采样投入与测量优先级提供了简洁高效的原则。