Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
翻译:测试对数似然常被用于比较同一数据的不同模型或拟合同一概率模型的不同近似推理算法。我们通过简单示例表明,基于测试对数似然的比较可能会与其他目标的比较结果相矛盾。具体而言,我们的示例显示:(i) 获得更高测试对数似然的近似贝叶斯推理算法未必能产生更准确的后验近似;(ii) 基于测试对数似然比较得出的预测准确性结论可能与基于均方根误差得出的结论不一致。