LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.

翻译：摘要：当模型的概率估计与输出实际正确的可能性一致时，该模型被视为校准良好。校准语言模型至关重要，因为它在检测和缓解语言模型幻觉以及构建更可信模型方面发挥着关键作用。然而，标准校准技术可能不适用于语言模型校准。例如，温度缩放等后处理方法无法对候选生成结果重新排序。另一方面，基于训练的方法需要微调整个模型，这对于大规模语言模型而言是不切实际的。我们提出了LitCab，一种轻量级校准机制，它由一个单一的线性层组成，该层接收输入文本表示并预测一个偏置项，随后将其添加到语言模型输出的logits中。LitCab仅通过增加不到原始模型参数2%的参数量即可改善模型校准。为了进行评估，我们构建了CaT基准测试，包含八个文本生成任务，覆盖从短短语到段落的响应。我们使用Llama2-7B测试LitCab，结果显示它在所有任务上均提升了校准性能，平均ECE分数降低高达30%。我们进一步使用GPT和LLaMA系列中多个流行的开源语言模型进行了全面评估，得出以下关键发现：（i）同一系列中较大的模型在短文本生成任务上表现更好的校准，但在较长任务上未必如此。（ii）尽管参数量少得多，GPT系列模型在校准方面优于LLaMA、Llama2和Vicuna模型。（iii）使用有限目的样本（例如对话）微调预训练模型（如LLaMA）可能导致校准变差，这突出了微调设置对语言模型校准的重要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日