This paper focuses on a very important societal challenge of water quality analysis. Being one of the key factors in the economic and social development of society, the provision of water and ensuring its quality has always remained one of the top priorities of public authorities. To ensure the quality of water, different methods for monitoring and assessing the water networks, such as offline and online surveys, are used. However, these surveys have several limitations, such as the limited number of participants and low frequency due to the labor involved in conducting such surveys. In this paper, we propose a Natural Language Processing (NLP) framework to automatically collect and analyze water-related posts from social media for data-driven decisions. The proposed framework is composed of two components, namely (i) text classification, and (ii) topic modeling. For text classification, we propose a merit-fusion-based framework incorporating several Large Language Models (LLMs) where different weight selection and optimization methods are employed to assign weights to the LLMs. In topic modeling, we employed the BERTopic library to discover the hidden topic patterns in the water-related tweets. We also analyzed relevant tweets originating from different regions and countries to explore global, regional, and country-specific issues and water-related concerns. We also collected and manually annotated a large-scale dataset, which is expected to facilitate future research on the topic.
翻译:本文聚焦于水质分析这一重要的社会挑战。作为社会经济发展的关键要素之一,水资源供给及其质量保障始终是公共管理部门的优先事项。为确保水质,相关部门采用离线与在线调查等不同方法监测和评估供水管网。然而,这些调查存在参与者数量有限、劳动密集型导致调查频率低等局限性。本文提出一个基于自然语言处理(NLP)的框架,用于自动收集并分析社交媒体上与水质相关的帖子,以支持数据驱动决策。该框架由文本分类与主题建模两个模块组成。在文本分类中,我们提出基于融合评分的框架,整合多个大语言模型(LLMs),采用不同的权重选择与优化方法为各LLMs分配权重。在主题建模中,我们运用BERTopic库发现与水质相关推文中隐藏的主题模式。我们还分析了来自不同地区和国家的相关推文,探索全球性、区域性及特定国家的水质问题与关切。此外,我们收集并人工标注了一个大规模数据集,该数据集有望促进该领域的后续研究。