The coalescent is a foundational model of latent genealogical trees under neutral evolution, but suffers from intractable sampling probabilities. Methods for approximating these sampling probabilities either introduce bias or fail to scale to large sample sizes. We show that a class of cost functionals of the coalescent with recurrent mutation and a finite number of alleles converge to tractable processes in the infinite-sample limit. A particular choice of costs yields insight about importance sampling methods, which are a classical tool for coalescent sampling probability approximation. These insights reveal that the behaviour of coalescent importance sampling algorithms differs markedly from standard sequential importance samplers, with or without resampling. We conduct a simulation study to verify that our asymptotics are accurate for algorithms with finite (and moderate) sample sizes. Our results constitute the first theoretical description of large-sample importance sampling algorithms for the coalescent, provide heuristics for the a priori optimisation of computational effort, and identify settings where resampling is harmful for algorithm performance. We observe strikingly different behaviour for importance sampling methods under the infinite sites model of mutation, which is regarded as a good and more tractable approximation of finite alleles mutation in most respects.
翻译:溯祖模型是中性演化下潜在谱系树的基础模型,但其采样概率难以处理。近似这些采样概率的方法要么引入偏差,要么无法扩展到大规模样本。我们证明,在无限样本极限下,一类具有重复突变和有限等位基因数的溯祖模型代价泛函会收敛到可处理的过程。特定代价函数的选择为重要性采样方法提供了洞见,而重要性采样是近似溯祖采样概率的经典工具。这些发现揭示,无论是否采用重采样,溯祖重要性采样算法的行为都与标准序列重要性采样器存在显著差异。我们通过模拟研究验证了我们的渐近理论在有限(及中等)样本量算法中的准确性。我们的研究首次从理论上描述了大样本溯祖重要性采样算法,为计算资源的先验优化提供了启发式方法,并识别了重采样会损害算法性能的场景。我们观察到在无限位点突变模型下重要性采样方法表现出截然不同的行为,而该模型在大多数方面被视为有限等位基因突变的良好且更易处理的近似。