Credible intervals and credible sets, such as highest posterior density (HPD) intervals, form an integral statistical tool in Bayesian phylogenetics, both for phylogenetic analyses and for development. Readily available for continuous parameters such as base frequencies and clock rates, the vast and complex space of tree topologies poses significant challenges for defining analogous credible sets. Traditional frequency-based approaches are inadequate for diffuse posteriors where sampled trees are often unique. To address this, we introduce novel and efficient methods for estimating the credible level of individual tree topologies using tractable tree distributions, specifically Conditional Clade Distribution (CCD). Furthermore, we propose a new concept called $α$ credible CCD, which encapsulates a CCD whose trees collectively make up $α$ probability. We present algorithms to compute these credible CCDs efficiently and to determine credible levels of tree topologies as well as of subtrees. We evaluate the accuracy of these credible set methods leveraging simulated and real datasets. Furthermore, to demonstrate the utility of our methods, we use well-calibrated simulation studies to evaluate the performance of different CCD models. In particular, we show how the credible set methods can be used to conduct rank-uniformity validation and produce Empirical Cumulative Distribution Function (ECDF) plots, supplementing standard coverage analyses for continuous parameters.
翻译:可信区间和可信集(如最高后验密度区间)是贝叶斯系统发育学中的核心统计工具,既用于系统发育分析也用于方法开发。虽然对于连续参数(如碱基频率和分子钟速率)已有现成方法,但树拓扑结构庞大而复杂的空间对定义类似的可信集构成了重大挑战。传统基于频率的方法难以适用于后验分布分散、采样树往往具有唯一性的场景。为解决这一问题,我们提出利用可解析的树分布(特别是条件分支分布(CCD))估计单个树拓扑结构可信水平的高效新方法。进一步地,我们提出名为"α可信CCD"的新概念,即包含所有累计概率达α的树拓扑结构的CCD子集。我们设计了高效算法来计算这些可信CCD,并确定树拓扑结构及子树的可信水平。通过模拟和真实数据集,我们评估了这些可信集方法的准确性。此外,为展示方法的应用价值,我们利用完美校准的模拟研究评价了不同CCD模型的性能。特别地,我们展示了如何运用这些可信集方法进行秩均匀性验证并生成经验累积分布函数(ECDF)图,作为连续参数标准覆盖分析的补充。