We study the problem of clustering $T$ trajectories of length $H$, each generated by one of K unknown ergodic Markov chains over a finite state space of size $S$. We derive an instance-dependent, high-probability lower bound on the clustering error rate, governed by the stationary-weighted KL divergence between transition kernels. We then propose a two-stage algorithm: Stage I applies spectral clustering via a new injective Euclidean embedding for ergodic Markov chains, a contribution of independent interest enabling sharp concentration results; Stage II refines clusters with a single likelihood-based reassignment step. We prove that our algorithm achieves near-optimal clustering error with high probability under reasonable requirements on $T$ and $H$. Preliminary experiments support our approach, and we conclude with discussions of its limitations and extensions.
翻译:我们研究了在有限状态空间(大小为$S$)上,由$K$个未知遍历马尔可夫链之一生成的$T$条长度为$H$的轨迹的聚类问题。我们推导了一个依赖于具体实例、高概率的聚类错误率下界,该下界由转移核的平稳分布加权KL散度所决定。随后,我们提出了一种两阶段算法:第一阶段通过一种新的遍历马尔可夫链的单射欧几里得嵌入进行谱聚类,这一独立贡献使得我们能够获得尖锐的集中性结果;第二阶段通过一个基于似然的重新分配步骤来细化聚类。我们证明,在关于$T$和$H$的合理要求下,我们的算法能以高概率达到接近最优的聚类错误率。初步实验支持了我们的方法,最后我们讨论了其局限性与扩展方向。