Wastewater treatment plants are increasingly recognized as promising candidates for machine learning applications, due to their societal importance and high availability of data. However, their varied designs, operational conditions, and influent characteristics hinder straightforward automation. In this study, we use data from a pilot reactor at the Veas treatment facility in Norway to explore how machine learning can be used to optimize biological nitrate ($\mathrm{NO_3^-}$) reduction to molecular nitrogen ($\mathrm{N_2}$) in the biogeochemical process known as \textit{denitrification}. Rather than focusing solely on predictive accuracy, our approach prioritizes understanding the foundational requirements for effective data-driven modelling of wastewater treatment. Specifically, we aim to identify which process parameters are most critical, the necessary data quantity and quality, how to structure data effectively, and what properties are required by the models. We find that nonlinear models perform best on the training and validation data sets, indicating nonlinear relationships to be learned, but linear models transfer better to the unseen test data, which comes later in time. The variable measuring the water temperature has a particularly detrimental effect on the models, owing to a significant change in distributions between training and test data. We therefore conclude that multiple years of data is necessary to learn robust machine learning models. By addressing foundational elements, particularly in the context of the climatic variability faced by northern regions, this work lays the groundwork for a more structured and tailored approach to machine learning for wastewater treatment. We share publicly both the data and code used to produce the results in the paper.
翻译:废水处理厂因其社会重要性及数据的高度可获得性,日益被视为机器学习应用的潜力领域。然而,其多样化的设计、运行条件和进水特性阻碍了直接自动化。本研究利用挪威Veas处理设施的中试反应器数据,探讨如何通过机器学习优化生物地球化学过程——即反硝化作用——中硝酸盐($\mathrm{NO_3^-}$)向分子氮($\mathrm{N_2}$)的还原。我们的方法不仅关注预测准确性,更优先理解废水处理有效数据驱动建模的基础要求。具体而言,我们旨在确定哪些工艺参数最为关键、所需数据的数量与质量、如何有效组织数据,以及模型需要具备哪些特性。研究发现,非线性模型在训练和验证数据集上表现最佳,表明存在需要学习的非线性关系;但线性模型在时间上较晚的未见测试数据上具有更好的迁移能力。由于训练数据与测试数据之间分布存在显著变化,水温变量对模型产生了尤为不利的影响。因此我们得出结论:需要多年的数据才能训练出稳健的机器学习模型。通过解决基础要素问题——特别是在北方地区面临的气候多变性背景下——本研究为构建更结构化、定制化的废水处理机器学习方法奠定了基础。我们公开了论文中用于产生结果的数据和代码。