We propose and evaluate two methods that validate the computation of Bayes factors: one based on an improved variant of simulation-based calibration checking (SBC) and one based on calibration metrics for binary predictions. We show that in theory, binary prediction calibration is equivalent to a special case of SBC, but with limited resources, binary prediction calibration is typically more sensitive to the problems we investigated. With well-designed test quantities, SBC can however detect all possible problems in computation, including some that cannot be uncovered by binary prediction calibration. Previous work on Bayes factor validation includes checks based on the data-averaged posterior and the Good check method. We demonstrate that both checks miss many problems in Bayes factor computation detectable with SBC and binary prediction calibration. Moreover, we find that the Good check as originally described fails to control its error rates. Our proposed checks also typically use simulation results more efficiently than data-averaged posterior checks. Finally, we show that a special approach based on posterior SBC is necessary when checking Bayes factor computation under improper priors and we validate several models with such priors. We recommend that novel methods for Bayes factor computation be validated with SBC, binary prediction calibration and data-averaged posterior with at least several hundred simulations. For all the models we tested, the bridgesampling and BayesFactor R packages satisfy all available checks and thus are likely safe to use in standard scenarios.
翻译:我们提出并评估了两种验证贝叶斯因子计算的方法:一种基于改进的模拟校准检验(SBC)变体,另一种基于二元预测的校准度量。理论上,我们证明二元预测校准等价于SBC的一种特例,但在有限资源条件下,二元预测校准通常对我们研究的问题更为敏感。然而,通过精心设计的检验统计量,SBC能够检测计算中所有可能的问题,包括一些二元预测校准无法揭示的问题。先前关于贝叶斯因子验证的工作包括基于数据平均后验的检验和Good检验方法。我们证明这两种检验都会遗漏许多可通过SBC和二元预测校准检测到的贝叶斯因子计算问题。此外,我们发现原始描述的Good检验无法控制其错误率。我们提出的检验方法通常也比数据平均后验检验更高效地利用模拟结果。最后,我们证明在非正常先验下检验贝叶斯因子计算时,需要采用基于后验SBC的特殊方法,并以此验证了多个采用此类先验的模型。我们建议使用SBC、二元预测校准以及至少数百次模拟的数据平均后验来验证新型贝叶斯因子计算方法。对于所有测试模型,bridgesampling和BayesFactor R软件包均满足所有可用检验,因此在标准场景中使用可能是安全的。