Passenger clustering based on trajectory records is essential for transportation operators. However, existing methods cannot easily cluster the passengers due to the hierarchical structure of the passenger trip information, including multiple trips within each passenger and multi-dimensional information about each trip. Furthermore, existing approaches rely on an accurate specification of the clustering number to start. Finally, existing methods do not consider spatial semantic graphs such as geographical proximity and functional similarity between the locations. In this paper, we propose a novel tensor Dirichlet Process Multinomial Mixture model with graphs, which can preserve the hierarchical structure of the multi-dimensional trip information and cluster them in a unified one-step manner with the ability to determine the number of clusters automatically. The spatial graphs are utilized in community detection to link the semantic neighbors. We further propose a tensor version of Collapsed Gibbs Sampling method with a minimum cluster size requirement. A case study based on Hong Kong metro passenger data is conducted to demonstrate the automatic process of cluster amount evolution and better cluster quality measured by within-cluster compactness and cross-cluster separateness. The code is available at https://github.com/bonaldli/TensorDPMM-G.
翻译:基于轨迹记录的乘客聚类对交通运营者至关重要。然而,现有方法难以有效聚类乘客,原因在于乘客出行信息具有层次化结构,每位乘客包含多次行程,且每次行程包含多维信息。此外,现有方法依赖准确指定聚类数作为初始条件。最后,现有方法未考虑空间语义图,例如地理邻近性和位置间的功能相似性。本文提出一种新颖的基于图的张量狄利克雷过程多项式混合模型,该模型能保留多维出行信息的层次化结构,以统一的一步式方式实现聚类,并具备自动确定聚类数量的能力。空间图被用于社区检测以关联语义邻居。我们进一步提出一种带最小簇规模约束的张量折叠吉布斯采样方法。基于香港地铁乘客数据的案例研究展示了聚类数量演变的自动化过程,以及通过簇内紧密度和簇间分离度衡量的更优聚类质量。代码已开源:https://github.com/bonaldli/TensorDPMM-G。