Clustering algorithms frequently require the number of clusters to be chosen in advance, but it is usually not clear how to do this. To tackle this challenge when clustering within sequential data, we present a method for estimating the number of clusters when the data is a trajectory of a Block Markov Chain. Block Markov Chains are Markov Chains that exhibit a block structure in their transition matrix. The method considers a matrix that counts the number of transitions between different states within the trajectory, and transforms this into a spectral embedding whose dimension is set via singular value thresholding. The number of clusters is subsequently estimated via density-based clustering of this spectral embedding, an approach inspired by literature on the Stochastic Block Model. By leveraging and augmenting recent results on the spectral concentration of random matrices with Markovian dependence, we show that the method is asymptotically consistent - in spite of the dependencies between the count matrix's entries, and even when the count matrix is sparse. We also present a numerical evaluation of our method, and compare it to alternatives.
翻译:聚类算法通常需要预先设定聚类数目,但如何合理选择该数目往往并不明确。为应对序列数据聚类中的这一挑战,本文提出了一种针对块马尔可夫链轨迹数据的聚类数目估计方法。块马尔可夫链是指其转移矩阵具有块结构的马尔可夫链。该方法通过统计轨迹中不同状态间的转移次数构建计数矩阵,并利用奇异值阈值法确定维度后将其转换为谱嵌入表示。随后基于该谱嵌入进行密度聚类来估计聚类数目,这一思路受到随机块模型相关研究的启发。通过借鉴并拓展近期关于马尔可夫依赖随机矩阵谱集中性的研究成果,我们证明了该方法具有渐近一致性——尽管计数矩阵元素间存在依赖性,且即使计数矩阵是稀疏的。本文还通过数值实验评估了所提方法,并与替代方案进行了比较。