Evaluating marginal likelihood approximations of dose-response relationship models in Bayesian benchmark dose methods for risk assessment

Benchmark dose (BMD; a dose associated with a specified change in response) is used to determine the point of departure for the acceptable daily intake of substances for humans. Multiple dose-response relationship models are considered in the BMD method. Bayesian model averaging (BMA) is commonly used, where several models are averaged based on their posterior probabilities, which are determined by calculating the marginal likelihood (ML). Several ML approximation methods are employed in standard software packages, such as BBMD, \texttt{ToxicR}, and Bayesian BMD for the BMD method, because the ML cannot be analytically calculated. Although ML values differ among approximation methods, resulting in different posterior probabilities and BMD estimates, this phenomenon is neither widely recognized nor quantitatively evaluated. In this study, we evaluated the performance of five ML approximation methods: (1) maximum likelihood estimation (MLE)-based Schwarz criterion, (2) Markov chain Monte Carlo (MCMC)-based Schwarz criterion, (3) Laplace approximation, (4) density estimation, and (5) bridge sampling through numerical examples using four real experimental datasets. Eight models and three prior distributions used in BBMD and \texttt{ToxicR} were assumed. The approximation and estimation biases of bridge sampling were the smallest regardless of the dataset or prior distributions. Both the approximation and estimation biases of MCMC-based Schwarz criterion and Laplace approximation were large for some datasets. Thus, the approximation biases of the density estimation were relatively small but were large for some datasets. In terms of the accuracy of ML approximation methods, using Bayesian BMD, in which the bridge sampling is available, is preferred.

翻译：基准剂量（BMD；与特定反应变化相关的剂量）用于确定人体物质每日可接受摄入量的起始点。BMD方法中需考虑多种剂量-反应关系模型。贝叶斯模型平均（BMA）是常用方法，该方法基于各模型的后验概率对多个模型进行加权平均，而后验概率通过计算边际似然（ML）确定。由于ML无法解析计算，在BBMD、\texttt{ToxicR}和Bayesian BMD等标准软件包中采用了多种ML近似方法。尽管不同近似方法得到的ML值存在差异，从而导致后验概率和BMD估计值不同，但这一现象既未得到广泛认知，也缺乏定量评估。本研究通过四个真实实验数据集的数值算例，评估了五种ML近似方法的性能：（1）基于最大似然估计（MLE）的施瓦茨准则，（2）基于马尔可夫链蒙特卡洛（MCMC）的施瓦茨准则，（3）拉普拉斯近似，（4）密度估计，以及（5）桥抽样。研究假设了BBMD和\texttt{ToxicR}中使用的八个模型及三个先验分布。无论数据集或先验分布如何，桥抽样的近似偏差和估计偏差均为最小。基于MCMC的施瓦茨准则和拉普拉斯近似在某些数据集上均表现出较大的近似偏差和估计偏差。密度估计的近似偏差相对较小，但在某些数据集上仍较大。就ML近似方法的准确性而言，推荐使用支持桥抽样的Bayesian BMD软件。