A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations, a common issue of LMs, as well as building more trustworthy models. Yet, popular neural model calibration techniques are not well-suited for LMs due to their lack of flexibility in discerning answer correctness and their high computational costs. For instance, post-processing methods like temperature scaling are often unable to reorder the candidate generations. Moreover, training-based methods require finetuning the entire model, which is impractical due to the increasing sizes of modern LMs. In this paper, we present LitCab, a lightweight calibration mechanism consisting of a single linear layer taking the input text representation and manipulateing the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of 7 text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, by reducing the average ECE score by 20%. We further conduct a comprehensive evaluation with 7 popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (1) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (2) GPT-family models show superior calibration compared to LLaMA, Llama2 and Vicuna models despite having much fewer parameters. (3) Finetuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of finetuning setups for calibrating LMs.
翻译:当模型概率估计与实际输出正确的可能性一致时,该模型被认为是良好校准的。校准语言模型至关重要,因为它在检测和缓解幻觉(语言模型的常见问题)以及构建更可信模型方面发挥着关键作用。然而,流行的神经模型校准技术由于缺乏判断答案正确性的灵活性以及计算成本高昂,并不适用于语言模型。例如,诸如温度缩放之类的后处理方法通常无法对候选生成结果进行重排序。此外,基于训练的方法需要微调整个模型,由于现代语言模型规模不断增大,这在实际中难以实现。本文提出LitCab,这是一种轻量级校准机制,由单个线性层组成,该层接收输入文本表示并调节语言模型输出logits。LitCab仅通过增加不到原始模型参数2%的参数来改进模型校准。为进行评估,我们构建了CaT基准测试集,包含7个文本生成任务,涵盖从短短语到段落的响应。我们使用Llama2-7B测试LitCab,其通过将平均ECE分数降低20%来改善所有任务的校准。我们进一步对来自GPT和LLaMA家族的7个流行开源语言模型进行全面评估,得出以下关键发现:(1)同一家族内较大的模型在短生成任务上表现出更好的校准,但对于较长任务则未必如此。(2)尽管参数少得多,GPT家族模型显示出优于LLaMA、Llama2和Vicuna模型的校准性能。(3)使用有限用途的样本(如对话)微调预训练模型(如LLaMA)可能导致校准性能下降,这凸显了微调设置对校准语言模型的重要性。