Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
翻译:测试对数似然通常用于比较同一数据的不同模型,或比较拟合同一概率模型的不同近似推断算法。我们通过简单示例表明,基于测试对数似然的比较可能与其他目标的比较结果相矛盾。具体而言,我们的示例显示:(i) 获得更高测试对数似然的近似贝叶斯推断算法,未必能产生更准确的后验近似;(ii) 基于测试对数似然比较得出的预测准确性结论,可能与基于均方根误差得出的结论不一致。