LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths

A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations, a common issue of LMs, as well as building more trustworthy models. Yet, popular neural model calibration techniques are not well-suited for LMs due to their lack of flexibility in discerning answer correctness and their high computational costs. For instance, post-processing methods like temperature scaling are often unable to reorder the candidate generations. Moreover, training-based methods require finetuning the entire model, which is impractical due to the increasing sizes of modern LMs. In this paper, we present LitCab, a lightweight calibration mechanism consisting of a single linear layer taking the input text representation and manipulateing the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of 7 text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, by reducing the average ECE score by 20%. We further conduct a comprehensive evaluation with 7 popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (1) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (2) GPT-family models show superior calibration compared to LLaMA, Llama2 and Vicuna models despite having much fewer parameters. (3) Finetuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of finetuning setups for calibrating LMs.

翻译：当模型概率估计与实际输出正确的可能性一致时，该模型被认为是良好校准的。校准语言模型至关重要，因为它在检测和缓解幻觉（语言模型的常见问题）以及构建更可信模型方面发挥着关键作用。然而，流行的神经模型校准技术由于缺乏判断答案正确性的灵活性以及计算成本高昂，并不适用于语言模型。例如，诸如温度缩放之类的后处理方法通常无法对候选生成结果进行重排序。此外，基于训练的方法需要微调整个模型，由于现代语言模型规模不断增大，这在实际中难以实现。本文提出LitCab，这是一种轻量级校准机制，由单个线性层组成，该层接收输入文本表示并调节语言模型输出logits。LitCab仅通过增加不到原始模型参数2%的参数来改进模型校准。为进行评估，我们构建了CaT基准测试集，包含7个文本生成任务，涵盖从短短语到段落的响应。我们使用Llama2-7B测试LitCab，其通过将平均ECE分数降低20%来改善所有任务的校准。我们进一步对来自GPT和LLaMA家族的7个流行开源语言模型进行全面评估，得出以下关键发现：（1）同一家族内较大的模型在短生成任务上表现出更好的校准，但对于较长任务则未必如此。（2）尽管参数少得多，GPT家族模型显示出优于LLaMA、Llama2和Vicuna模型的校准性能。（3）使用有限用途的样本（如对话）微调预训练模型（如LLaMA）可能导致校准性能下降，这凸显了微调设置对校准语言模型的重要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日