Hidden Markov models (HMMs) are characterized by an unobservable (hidden) Markov chain and an observable process, which is a noisy version of the hidden chain. Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses. Existing decoding algorithms such as the Viterbi algorithm have computational complexity at best linear in the length of the observed sequence, and sub-quadratic in the size of the state space of the Markov chain. We present Quick Adaptive Ternary Segmentation (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity in the length of the sequence, and cubic in the size of the state space, hence particularly suited for large scale HMMs with relatively few states. The procedure also suggests an effective way of data storage as specific cumulative sums. In essence, the estimated sequence of states sequentially maximizes local likelihood scores among all local paths with at most three segments. The maximization is performed only approximately using an adaptive search procedure. The resulting sequence is admissible in the sense that all transitions occur with positive probability. To complement formal results justifying our approach, we present Monte-Carlo simulations which demonstrate the speedups provided by QATS in comparison to Viterbi, along with a precision analysis of the returned sequences. An implementation of QATS in C++ is provided in the R-package QATS and is available from GitHub.
翻译:隐马尔可夫模型(HMMs)由一个不可观测(隐藏)的马尔可夫链和一个可观测过程(即隐藏链的含噪版本)构成。从含噪观测中解码原始信号(即隐藏链)是几乎所有基于HMM的数据分析的主要目标之一。现有解码算法(如Viterbi算法)的计算复杂度至少为观测序列长度的线性阶,且为状态空间规模的亚二次阶。我们提出快速自适应三元分割(QATS),这是一种分治策略,以序列长度的多对数计算复杂度和状态空间规模的三次阶解码隐藏序列,因此特别适用于状态数较少的大规模HMM。该过程还提出了一种通过特定累积和进行数据存储的有效方法。本质上,估计的序列状态依次最大化所有最多包含三个分段的局部路径的局部似然得分,并通过自适应搜索过程仅近似实现最大化。所得序列是可接受的,即所有转移都以正概率发生。为补充形式化结果的合理性,我们通过蒙特卡洛模拟展示了QATS相较于Viterbi的加速效果,并进行了返回序列的精度分析。QATS在C++中的实现已在R包QATS中提供,可从GitHub获取。