We consider the problem of identifying jointly the ancestral sequence, the phylogeny and the parameters in models of DNA sequence evolution with insertion and deletion (indel). Under the classical TKF91 model of sequence evolution, we obtained explicit formulas for the root sequence, the pairwise distances of leaf sequences, as well as the scaled rates of indel and substitution in terms of the distribution of the leaf sequences of an arbitrary phylogeny. These explicit formulas not only strengthen existing invertibility results and work for phylogeny that are not necessarily ultrametric, but also lead to new estimators with less assumption compared with the existing literature. Our simulation study demonstrates that these estimators are statistically consistent as the number of independent samples tends to infinity.
翻译:本文研究了在包含插入与删除(indel)的DNA序列进化模型中,如何联合识别祖先序列、系统发育树及相关参数的问题。基于经典的TKF91序列进化模型,我们推导出了根序列、叶序列间成对距离以及插入删除与替换的标度化速率的显式表达式,这些表达式由任意系统发育树叶序列的分布所表示。这些显式公式不仅强化了现有的可逆性结果,且适用于非超度量的系统发育树,同时相较于现有文献,能够以更弱的假设条件导出新的估计量。我们的模拟研究表明,当独立样本数量趋于无穷时,这些估计量具有统计一致性。