Random partitioned distribution is a powerful tool for model-based clustering. However, the implementation in practice can be challenging for functional spatial data such as hourly observed population data observed in each region. The reason is that high dimensionality tends to yield excess clusters, and spatial dependencies are challenging to represent with a simple random partition distribution (e.g., the Dirichlet process). This paper addresses these issues by extending the generalized Dirichlet process to incorporate pairwise similarity information, which we call the similarity-based generalized Dirichlet process (SGDP), and provides theoretical justification for this approach. We apply SGDP to hourly population data observed in 500m meshes in Tokyo, and demonstrate its usefulness for functional clustering by taking account of spatial information.
翻译:随机划分分布是基于模型的聚类分析的有力工具。然而,对于功能型空间数据(如每小时观测到的各区域人口数据)的实际应用仍具有挑战性。原因是高维数据易产生过多聚类,且空间依赖性难以通过简单随机划分分布(如狄利克雷过程)有效刻画。本文通过扩展广义狄利克雷过程以融入成对相似性信息(称为基于相似性的广义狄利克雷过程,SGDP)来解决上述问题,并为该方法的有效性提供了理论依据。我们将SGDP应用于东京500米网格的逐时人口数据,通过整合空间信息验证了其在函数型聚类中的实用价值。