Model selection is a central task in statistics, but standard methods are not robust in misspecified settings where the true data-generating process (DGP) is not in the set of candidate models. The key limitation is that existing methods -- including information criteria and Bayesian posteriors -- do not quantify uncertainty about how well each candidate model approximates the true DGP. In this paper, we introduce a novel approach to model selection based on modeling the likelihood values themselves. Specifically, given $K$ candidate models and $n$ observations, we view the $n\times K$ matrix of negative log-likelihood values as a random data matrix and observe that the expectation of each row is equal to the vector of Kullback--Leibler divergences between the $K$ models and the true DGP, up to an additive constant. We use a multivariate normal model to estimate and quantify uncertainty in this expectation, providing calibrated inferences for robust model selection under misspecification. The procedure is easy to compute, interpretable, and comes with theoretical guarantees, including consistency.
翻译:模型选择是统计学中的核心任务,但标准方法在误设情境下并不稳健,即真实数据生成过程(DGP)不在候选模型集合中。现有方法(包括信息准则和贝叶斯后验)的关键局限在于,它们无法量化每个候选模型逼近真实DGP程度的不确定性。本文提出一种基于对似然值本身建模的新颖模型选择方法。具体而言,给定 $K$ 个候选模型和 $n$ 个观测值,我们将 $n\times K$ 的负对数似然值矩阵视为随机数据矩阵,并观察到每行的期望值等于 $K$ 个模型与真实DGP之间的Kullback--Leibler散度向量(相差一个加性常数)。我们采用多元正态模型来估计该期望值并量化其不确定性,从而为误设下的稳健模型选择提供校准推断。该方法计算简便、可解释性强,并具有理论保证(包括一致性)。