This paper explores the application of machine learning to enhance our understanding of water accessibility issues in underserved communities called Colonias located along the northern part of the United States - Mexico border. We analyzed more than 2000 such communities using data from the Rural Community Assistance Partnership (RCAP) and applied hierarchical clustering and the adaptive affinity propagation algorithm to automatically group Colonias into clusters with different water access conditions. The Gower distance was introduced to make the algorithm capable of processing complex datasets containing both categorical and numerical attributes. To better understand and explain the clustering results derived from the machine learning process, we further applied a decision tree analysis algorithm to associate the input data with the derived clusters, to identify and rank the importance of factors that characterize different water access conditions in each cluster. Our results complement experts' priority rankings of water infrastructure needs, providing a more in-depth view of the water insecurity challenges that the Colonias suffer from. As an automated and reproducible workflow combining a series of tools, the proposed machine learning pipeline represents an operationalized solution for conducting data-driven analysis to understand water access inequality. This pipeline can be adapted to analyze different datasets and decision scenarios.
翻译:本文探讨如何运用机器学习深化对美墨边境北部沿线服务不足的科洛尼亚(Colonias)社区用水可及性问题的理解。我们利用农村社区援助伙伴关系(RCAP)的数据分析了2000余个此类社区,应用层次聚类与自适应亲和传播算法自动将科洛尼亚划分为具有不同用水条件的群组。通过引入Gower距离,使算法能够处理同时包含类别型与数值型属性的复杂数据集。为更好地解释机器学习产生的聚类结果,我们进一步采用决策树分析算法将输入数据与所得聚类关联,识别并排序各聚类中表征不同用水条件的关键因素。研究结果补充了专家对水利基础设施需求的优先级排序,更深入地揭示了科洛尼亚社区面临的水资源不安全挑战。作为融合系列工具的自动化可复现工作流,本研究所提出的机器学习流程为开展数据驱动的用水不平等分析提供了可操作性解决方案,并能适配不同数据集与决策场景。