The continuous-time Markov chain (CTMC) is the mathematical workhorse of evolutionary biology. Learning CTMC model parameters using modern, gradient-based methods requires the derivative of the matrix exponential evaluated at the CTMC's infinitesimal generator (rate) matrix. Motivated by the derivative's extreme computational complexity as a function of state space cardinality, recent work demonstrates the surprising effectiveness of a naive, first-order approximation for a host of problems in computational biology. In response to this empirical success, we obtain rigorous deterministic and probabilistic bounds for the error accrued by the naive approximation and establish a "blessing of dimensionality" result that is universal for a large class of rate matrices with random entries. Finally, we apply the first-order approximation within surrogate-trajectory Hamiltonian Monte Carlo for the analysis of the early spread of SARS-CoV-2 across 44 geographic regions that comprise a state space of unprecedented dimensionality for unstructured (flexible) CTMC models within evolutionary biology.
翻译:连续时间马尔可夫链是进化生物学中的数学基础工具。使用现代基于梯度的学习方法估计连续时间马尔可夫链模型参数,需要计算矩阵指数在连续时间马尔可夫链无穷小生成元(速率)矩阵处的导数。受该导数作为状态空间基数函数的极端计算复杂性的驱动,近期研究揭示了一种朴素的一阶近似在计算生物学诸多问题中出人意料的有效性。针对这一经验成功,我们获得了该朴素近似误差的严格确定性和概率性界限,并建立了对一大类具有随机项的速率矩阵普遍适用的"维度祝福"结果。最后,我们将一阶近似应用于替代轨迹哈密顿蒙特卡洛方法中,分析了SARS-CoV-2在44个地理区域间的早期传播——该分析涉及的状态空间维度在进化生物学非结构化(灵活)连续时间马尔可夫链模型中达到前所未有的规模。