Learning Mixtures of Markov Chains with Quality Guarantees

A large number of modern applications ranging from listening songs online and browsing the Web to using a navigation app on a smartphone generate a plethora of user trails. Clustering such trails into groups with a common sequence pattern can reveal significant structure in human behavior that can lead to improving user experience through better recommendations, and even prevent suicides [LMCR14]. One approach to modeling this problem mathematically is as a mixture of Markov chains. Recently, Gupta, Kumar and Vassilvitski [GKV16] introduced an algorithm (GKV-SVD) based on the singular value decomposition (SVD) that under certain conditions can perfectly recover a mixture of L chains on n states, given only the distribution of trails of length 3 (3-trail). In this work we contribute to the problem of unmixing Markov chains by highlighting and addressing two important constraints of the GKV-SVD algorithm [GKV16]: some chains in the mixture may not even be weakly connected, and secondly in practice one does not know beforehand the true number of chains. We resolve these issues in the Gupta et al. paper [GKV16]. Specifically, we propose an algebraic criterion that enables us to choose a value of L efficiently that avoids overfitting. Furthermore, we design a reconstruction algorithm that outputs the true mixture in the presence of disconnected chains and is robust to noise. We complement our theoretical results with experiments on both synthetic and real data, where we observe that our method outperforms the GKV-SVD algorithm. Finally, we empirically observe that combining an EM-algorithm with our method performs best in practice, both in terms of reconstruction error with respect to the distribution of 3-trails and the mixture of Markov Chains.

翻译：从在线听歌、浏览网页到使用智能手机导航应用等大量现代应用产生了海量用户轨迹。将这些轨迹聚类成具有共同序列模式的组别，可以揭示人类行为中的重要结构，从而通过更好的推荐改善用户体验，甚至预防自杀[LMCR14]。从数学角度建模该问题的一种方法是将其视为马尔可夫链的混合模型。近期，Gupta、Kumar和Vassilvitski [GKV16]提出了一种基于奇异值分解（SVD）的算法（GKV-SVD），该算法在特定条件下能够根据长度为3的轨迹（3-轨迹）分布完美恢复出由L条链在n个状态上构成的混合模型。本文通过强调并解决GKV-SVD算法[GKV16]的两个重要约束条件，对马尔可夫链解混问题做出了贡献：混合模型中的某些链可能甚至不满足弱连通性；其次在实际应用中无法预先获知链的真实数量。我们解决了Gupta等人论文[GKV16]中的这些问题。具体而言，我们提出了一种代数准则，能够高效选取避免过拟合的L值。此外，我们设计了一个重构算法，该算法能在存在不连通链的情况下输出真实混合模型，并对噪声具有鲁棒性。我们在合成数据与真实数据上进行了实验验证理论结果，观察到我们的方法优于GKV-SVD算法。最后，实验表明将EM算法与我们的方法相结合在实际应用中表现最佳——无论是在3-轨迹分布的重构误差方面，还是在马尔可夫链混合模型的重构误差方面均如此。