In many real-world applications, a model provider provides probabilistic forecasts to downstream decision-makers who use them to make decisions under diverse payoff objectives. The provider may have access to multiple predictive models, each potentially miscalibrated, and must choose which model to deploy in order to maximize the usefulness of predictions for downstream decisions. A central challenge arises: how can the provider meaningfully compare two predictors when neither is guaranteed to be well-calibrated, and when the relevant decision tasks may differ across users and contexts? To answer this, our first contribution introduces the notion of the informativeness gap between any two predictors, defined as the maximum normalized payoff advantage one predictor offers over the other across all decision-making tasks. Our framework strictly generalizes several existing notions: it subsumes U-Calibration and Calibration Decision Loss, which compare a miscalibrated predictor to its calibrated counterpart, and it recovers Blackwell informativeness as a special case when both predictors are perfectly calibrated. Our second contribution is a dual characterization of the informativeness gap, which gives rise to a natural informativeness measure that can be viewed as a relaxed variant of the earth mover's distance between two prediction distributions. We show that this measure satisfies natural desiderata: it is complete and sound, and it can be estimated sample-efficiently in the prediction-only access setting. We complement our theory with experiments on LLM-based forecasters in real-world prediction tasks, showing that the informativeness gap offers a more decision-relevant alternative to traditional metrics, and provides a principled lens for evaluating how ad hoc calibration post-processing affects downstream decision usefulness.
翻译:在许多现实应用中,模型提供者向下游决策者提供概率预测,后者基于这些预测在多样化收益目标下做出决策。提供者可能拥有多个预测模型,每个模型都可能存在校准偏差,因此必须选择部署哪个模型,以最大化预测对下游决策的实用性。一个核心挑战随之产生:当两个预测器均无法保证良好校准,且相关决策任务可能因用户和情境而异时,提供者如何有意义地比较它们?为此,我们的第一个贡献是引入了任意两个预测器之间的信息量差距概念,其定义为一个预测器相对于另一个在所有决策任务中能提供的最大归一化收益优势。我们的框架严格推广了多个现有概念:它包含了U-校准和校准决策损失(这些概念用于比较失准预测器与其校准版本),并在两个预测器均完美校准时退化为布莱克威尔信息量的特例。我们的第二个贡献是对信息量差距的对偶刻画,由此导出一个自然的信息量度量,可视为两个预测分布之间推土机距离的松弛变体。我们证明该度量满足自然的需求准则:具备完备性与可靠性,且在仅需预测访问的设置中可实现样本高效估计。我们通过在现实世界预测任务中对基于LLM的预测器进行实验来补充理论分析,结果表明信息量差距为传统评估指标提供了更具决策相关性的替代方案,并为评估临时性校准后处理如何影响下游决策效用提供了原则性视角。