Markov state modeling has gained popularity in various scientific fields due to its ability to reduce complex time series data into transitions between a few states. Yet, current frameworks are limited by assuming a single Markov chain describes the data, and they suffer an inability to discern heterogeneities. As a solution, this paper proposes a variational expectation-maximization algorithm that identifies a mixture of Markov chains in a time-series data set. The method is agnostic to the definition of the Markov states, whether data-driven (e.g. by spectral clustering) or based on domain knowledge. Variational EM efficiently and organically identifies the number of Markov chains and dynamics of each chain without expensive model comparisons or posterior sampling. The approach is supported by a theoretical analysis and numerical experiments, including simulated and observational data sets based on ${\tt Last.fm}$ music listening, ultramarathon running, and gene expression. The results show the new algorithm is competitive with contemporary mixture modeling approaches and powerful in identifying meaningful heterogeneities in time series data.
翻译:马尔可夫状态建模因其能够将复杂时间序列数据简化为若干状态间的转移过程,已在多个科学领域获得广泛应用。然而,现有框架受限于假设数据仅由单一马尔可夫链描述,无法有效识别数据中的异质性特征。为此,本文提出一种变分期望最大化算法,用于从时间序列数据集中识别马尔可夫链的混合结构。该方法对马尔可夫状态的定义方式保持中立,无论是基于数据驱动(如谱聚类)还是领域知识。变分EM算法能够高效且自组织地确定马尔可夫链的数量及各链的动态特性,无需昂贵的模型比较或后验采样。该方法的有效性通过理论分析和数值实验得到验证,实验数据包括基于${\tt Last.fm}$音乐收听记录、超级马拉松跑步数据及基因表达数据的模拟与观测数据集。结果表明,新算法在当代混合建模方法中具有竞争力,并能有效识别时间序列数据中具有实际意义的异质性模式。