Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://github.com/yascho/probabilistic-unlearning.
翻译:大语言模型的全面评估是一个开放的研究问题。现有评估依赖于通过贪心解码生成的确定性点估计。然而,我们发现确定性评估无法捕捉模型的完整输出分布,导致对模型能力的估计不准确。这在遗忘与对齐等关键场景中尤为成问题,因为精确的模型评估至关重要。为弥补此缺陷,我们首次在大语言模型中引入了形式化的概率评估框架。具体而言,我们推导出关于模型输出分布具有高概率保证的新颖度量指标。我们的指标与具体应用无关,允许实践者在模型部署前对其能力做出更可靠的估计。通过聚焦于遗忘的案例研究,我们揭示了确定性评估会错误地指示遗忘成功,而我们的概率评估则表明,在这些模型中,大部分(若非全部)被认为已遗忘的信息实际上仍然可被访问。此外,我们提出了一种基于熵优化和自适应温度缩放的新型遗忘损失函数,该函数在近期基准测试的概率设定下显著改善了遗忘效果。我们提出的从点估计转向输出分布的概率评估,代表了迈向大语言模型全面评估的重要一步。代码发布于 https://github.com/yascho/probabilistic-unlearning。