A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://github.com/yascho/probabilistic-unlearning.

翻译：大语言模型的全面评估是一个开放的研究问题。现有评估依赖于通过贪心解码生成的确定性点估计。然而，我们发现确定性评估无法捕捉模型的完整输出分布，导致对模型能力的估计不准确。这在遗忘与对齐等关键场景中尤为成问题，因为精确的模型评估至关重要。为弥补此缺陷，我们首次在大语言模型中引入了形式化的概率评估框架。具体而言，我们推导出关于模型输出分布具有高概率保证的新颖度量指标。我们的指标与具体应用无关，允许实践者在模型部署前对其能力做出更可靠的估计。通过聚焦于遗忘的案例研究，我们揭示了确定性评估会错误地指示遗忘成功，而我们的概率评估则表明，在这些模型中，大部分（若非全部）被认为已遗忘的信息实际上仍然可被访问。此外，我们提出了一种基于熵优化和自适应温度缩放的新型遗忘损失函数，该函数在近期基准测试的概率设定下显著改善了遗忘效果。我们提出的从点估计转向输出分布的概率评估，代表了迈向大语言模型全面评估的重要一步。代码发布于 https://github.com/yascho/probabilistic-unlearning。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日