Sequential data naturally arises from user engagement on digital platforms like social media, music streaming services, and web navigation, encapsulating evolving user preferences and behaviors through continuous information streams. A notable unresolved query in stochastic processes is learning mixtures of continuous-time Markov chains (CTMCs). While there is progress in learning mixtures of discrete-time Markov chains with recovery guarantees [GKV16,ST23,KTT2023], the continuous scenario uncovers unique unexplored challenges. The intrigue in CTMC mixtures stems from their potential to model intricate continuous-time stochastic processes prevalent in various fields including social media, finance, and biology. In this study, we introduce a novel framework for exploring CTMCs, emphasizing the influence of observed trails' length and mixture parameters on problem regimes, which demands specific algorithms. Through thorough experimentation, we examine the impact of discretizing continuous-time trails on the learnability of the continuous-time mixture, given that these processes are often observed via discrete, resource-demanding observations. Our comparative analysis with leading methods explores sample complexity and the trade-off between the number of trails and their lengths, offering crucial insights for method selection in different problem instances. We apply our algorithms on an extensive collection of Lastfm's user-generated trails spanning three years, demonstrating the capability of our algorithms to differentiate diverse user preferences. We pioneer the use of CTMC mixtures on a basketball passing dataset to unveil intricate offensive tactics of NBA teams. This underscores the pragmatic utility and versatility of our proposed framework. All results presented in this study are replicable, and we provide the implementations to facilitate reproducibility.
翻译:顺序数据自然产生于用户在数字平台(如社交媒体、音乐流媒体服务和网页导航)上的互动,通过连续的信息流捕捉用户不断演化的偏好和行为。随机过程中一个尚未解决的重要问题是连续时间马尔可夫链(CTMC)混合物的学习。尽管在具有恢复保证的离散时间马尔可夫链混合物学习方面已取得进展[GKV16,ST23,KTT2023],但连续场景揭示了独特的未探索挑战。CTMC混合物的研究价值源于其在社交媒体、金融和生物学等多个领域中对复杂连续时间随机过程建模的潜力。本研究提出了一种探索CTMC的新框架,重点分析观测轨迹长度和混合参数对问题领域的影响,这需要特定算法的支持。通过详尽的实验,我们研究了连续时间轨迹离散化对连续时间混合物可学习性的影响——由于这些过程通常通过离散且资源密集的观测来获取。与主流方法的比较分析探讨了样本复杂度以及轨迹数量与长度之间的权衡,为不同问题实例中的方法选择提供了关键见解。我们将算法应用于Lastfm平台横跨三年的海量用户生成轨迹数据集,展示了算法区分不同用户偏好的能力。我们率先将CTMC混合物应用于篮球传球数据集,揭示了NBA球队复杂的进攻战术。这凸显了我们所提出框架的实用性和通用性。本研究呈现的所有结果均可复现,并提供了相关实现代码以促进可重复性研究。