Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://github.com/yascho/probabilistic-unlearning.
翻译:大语言模型(LLMs)的全面评估是一个开放的研究问题。现有评估方法依赖于通过贪婪解码生成的确定性点估计。然而,我们发现确定性评估未能捕捉模型的完整输出分布,导致对模型能力的估计不准确。这在遗忘和对齐等关键场景中尤为突出,因为精确的模型评估至关重要。为弥补这一缺陷,我们首次在大语言模型中引入了形式化的概率评估框架。具体而言,我们推导出关于模型输出分布具有高概率保证的新颖度量指标。我们的指标具有应用无关性,使实践者能在模型部署前对其能力做出更可靠的估计。通过聚焦于遗忘的案例研究,我们揭示了确定性评估会错误地指示遗忘成功,而我们的概率评估则表明,即使不是全部,大部分所谓已被遗忘的信息在这些模型中仍然可被访问。此外,我们提出了一种基于熵优化和自适应温度缩放的新型遗忘损失函数,该函数在近期基准测试的概率设定下显著提升了遗忘效果。我们提出的从点估计转向输出分布的概率评估,代表了迈向大语言模型全面评估的重要一步。代码发布于 https://github.com/yascho/probabilistic-unlearning。