Real-time streamflow monitoring networks generate millions of observations annually, yet maintaining data quality across thousands of remote sensors remains labor-intensive. We introduce HydroGEM (Hydrological Generalizable Encoder for Monitoring), a foundation model for continental-scale streamflow quality control. HydroGEM uses two-stage training: self-supervised pretraining on 6.03 million sequences from 3,724 USGS stations learns hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures local temporal patterns and long-range dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out synthetic tests comprising 799 stations with 18 expert-validated anomaly types, HydroGEM achieves F1 = 0.792 for detection and 68.7% reconstruction-error reduction, a 36.3% improvement over existing methods. Zero-shot transfer to 100 Environment and Climate Change Canada stations yields F1 = 0.586, exceeding all baselines and demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns. HydroGEM is designed for human-in-the-loop workflows - outputs are quality control suggestions requiring expert review, not autonomous corrections.
翻译:实时径流监测网络每年产生数百万条观测数据,然而在数千个远程传感器之间保持数据质量仍然是一项劳动密集型任务。我们提出了HydroGEM(水文通用监测编码器),一种用于大陆尺度径流质量控制的基础模型。HydroGEM采用两阶段训练:首先在来自3,724个美国地质调查局(USGS)站点的603万条序列上进行自监督预训练,以学习水文表征;随后使用合成异常数据进行微调,用于异常检测与重建。该模型采用混合TCN-Transformer架构(1420万参数),能够捕捉局部时间模式和长程依赖关系,同时通过分层归一化处理六个数量级的流量变化。在包含799个站点、18种专家验证异常类型的保留合成测试集上,HydroGEM实现了F1分数=0.792的检测性能,重建误差降低了68.7%,较现有方法提升36.3%。在100个加拿大环境与气候变化部(ECCC)站点上的零样本迁移测试获得F1分数=0.586,超越所有基线方法,展现了跨国泛化能力。该模型在不同校正幅度下保持稳定的检测性能,并与实际业务中的季节性规律相符。HydroGEM专为人机协同工作流设计——其输出为需要专家审核的质量控制建议,而非自主校正。