Passenger clustering based on travel records is essential for transportation operators. However, existing methods cannot easily cluster the passengers due to the hierarchical structure of the passenger trip information, namely: each passenger has multiple trips, and each trip contains multi-dimensional multi-mode information. Furthermore, existing approaches rely on an accurate specification of the clustering number to start, which is difficult when millions of commuters are using the transport systems on a daily basis. In this paper, we propose a novel Tensor Dirichlet Process Multinomial Mixture model (Tensor-DPMM), which is designed to preserve the multi-mode and hierarchical structure of the multi-dimensional trip information via tensor, and cluster them in a unified one-step manner. The model also has the ability to determine the number of clusters automatically by using the Dirichlet Process to decide the probabilities for a passenger to be either assigned in an existing cluster or to create a new cluster: This allows our model to grow the clusters as needed in a dynamic manner. Finally, existing methods do not consider spatial semantic graphs such as geographical proximity and functional similarity between the locations, which may cause inaccurate clustering. To this end, we further propose a variant of our model, namely the Tensor-DPMM with Graph. For the algorithm, we propose a tensor Collapsed Gibbs Sampling method, with an innovative step of "disband and relocating", which disbands clusters with too small amount of members and relocates them to the remaining clustering. This avoids uncontrollable growing amounts of clusters. A case study based on Hong Kong metro passenger data is conducted to demonstrate the automatic process of learning the number of clusters, and the learned clusters are better in within-cluster compactness and cross-cluster separateness.
翻译:基于出行记录的乘客聚类对交通运输运营商至关重要。然而,现有方法难以有效聚类乘客,原因在于乘客出行信息具有层次化结构:每位乘客包含多次出行,每次出行又包含多维多模态信息。此外,现有方法需要预先准确指定聚类数量才能启动,这在数百万通勤者每日使用交通系统的场景下难以实现。本文提出一种新型张量狄利克雷过程多项混合模型(Tensor-DPMM),该模型通过张量保留出行信息的多模态与层次化结构,并以统一的一步式方式进行聚类。该模型还能通过狄利克雷过程自动确定聚类数量:根据概率判断将乘客分配到现有聚类或创建新聚类,从而以动态方式按需扩展聚类。最后,现有方法未考虑空间语义图(如地理位置邻近性与位置间功能相似性),可能导致聚类不准确。为此,我们进一步提出模型变体——融合图的Tensor-DPMM。在算法方面,我们提出一种张量折叠吉布斯采样方法,并引入创新性的“解散与重定位”步骤:解散成员数量过少的聚类并将其成员重定位至剩余聚类,从而避免聚类数量失控增长。基于香港地铁乘客数据的案例研究验证了该模型可自动学习聚类数量,且学习到的聚类在簇内紧密度与簇间分离度上均表现更优。