Inferring the infinitesimal rates of continuous-time Markov chains (CTMCs) is a central challenge in many scientific domains. This task is difficult because the number of rates grows quadratically with the state space, rates can be strongly dependent, and many transitions may be only partially observed. We introduce a Bayesian framework that models CTMC rates as flexible functions of covariates through Gaussian processes. This enables nonlinear covariate effects, improves inference by incorporating external information, and helps identify potential drivers of CTMC dynamics. For posterior inference, we use Hamiltonian Monte Carlo and develop scalable exact and approximate gradients for likelihoods involving repeated matrix exponentials. With $N$ observations and $K$ CTMC states, these gradients reduce the dominant cost of existing derivative calculations from $O(NK^3)$, with large constants, to $O(K^3+NK^2)$, with cheaper constants. We demonstrate the method in Bayesian phylogenetic and phylogeographic inference, where CTMCs are central, and show strong performance on synthetic and real datasets, including empirical quadratic scaling in $K$ even when $N<K$.
翻译:推断连续时间马尔可夫链(CTMC)的无穷小速率是许多科学领域的核心挑战。这一任务困难重重,因为速率数量随状态空间呈二次增长,速率之间可能存在强相关性,且许多转移可能仅被部分观测到。我们提出一个贝叶斯框架,通过高斯过程将CTMC速率建模为协变量的灵活函数。这使得非线性协变量效应成为可能,通过整合外部信息改进推断,并有助于识别CTMC动态的潜在驱动因素。在后验推断中,我们使用哈密顿蒙特卡洛方法,并针对涉及重复矩阵指数的似然函数,开发了可扩展的精确和近似梯度。对于$N$个观测值和$K$个CTMC状态,这些梯度将现有导数计算的主导成本从$O(NK^3)$(包含大常数)降低至$O(K^3+NK^2)$(包含更小的常数)。我们在CTMC处于核心地位的贝叶斯系统发育和生物地理学推断中展示了该方法,并在合成和真实数据集上表现出强劲性能,包括在$N<K$时仍能实现关于$K$的经验二次缩放。