The continuous-time Markov chain (CTMC) is the mathematical workhorse of evolutionary biology. Learning CTMC model parameters using modern, gradient-based methods requires the derivative of the matrix exponential evaluated at the CTMC's infinitesimal generator (rate) matrix. Motivated by the derivative's extreme computational complexity as a function of state space cardinality, recent work demonstrates the surprising effectiveness of a naive, first-order approximation for a host of problems in computational biology. In response to this empirical success, we obtain rigorous deterministic and probabilistic bounds for the error accrued by the naive approximation and establish a "blessing of dimensionality" result that is universal for a large class of rate matrices with random entries. Finally, we apply the first-order approximation within surrogate-trajectory Hamiltonian Monte Carlo for the analysis of the early spread of SARS-CoV-2 across 44 geographic regions that comprise a state space of unprecedented dimensionality for unstructured (flexible) CTMC models within evolutionary biology.
翻译:连续时间马尔可夫链(CTMC)是进化生物学的数学基础。使用现代基于梯度的方法学习CTMC模型参数时,需要计算矩阵指数在CTMC无穷小生成元(速率)矩阵处的导数。鉴于该导数计算复杂度随状态空间基数急剧增长,近期研究表明,对于计算生物学中的一系列问题,一种朴素的一阶近似出乎意料地有效。针对这一实证成功,我们获得了朴素近似误差的严格确定性界与概率界,并建立了对一大类具有随机元素的速率矩阵普遍成立的“维度祝福”结果。最后,我们将一阶近似应用于替代轨迹哈密顿蒙特卡洛方法,分析了SARS-CoV-2在44个地理区域的早期传播——这些区域构成的CTMC状态空间维度在进化生物学非结构化(灵活)模型中前所未有。