A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.
翻译:摘要:当模型的概率估计与输出实际正确的可能性一致时,该模型被视为校准良好。校准语言模型至关重要,因为它在检测和缓解语言模型幻觉以及构建更可信模型方面发挥着关键作用。然而,标准校准技术可能不适用于语言模型校准。例如,温度缩放等后处理方法无法对候选生成结果重新排序。另一方面,基于训练的方法需要微调整个模型,这对于大规模语言模型而言是不切实际的。我们提出了LitCab,一种轻量级校准机制,它由一个单一的线性层组成,该层接收输入文本表示并预测一个偏置项,随后将其添加到语言模型输出的logits中。LitCab仅通过增加不到原始模型参数2%的参数量即可改善模型校准。为了进行评估,我们构建了CaT基准测试,包含八个文本生成任务,覆盖从短短语到段落的响应。我们使用Llama2-7B测试LitCab,结果显示它在所有任务上均提升了校准性能,平均ECE分数降低高达30%。我们进一步使用GPT和LLaMA系列中多个流行的开源语言模型进行了全面评估,得出以下关键发现:(i)同一系列中较大的模型在短文本生成任务上表现更好的校准,但在较长任务上未必如此。(ii)尽管参数量少得多,GPT系列模型在校准方面优于LLaMA、Llama2和Vicuna模型。(iii)使用有限目的样本(例如对话)微调预训练模型(如LLaMA)可能导致校准变差,这突出了微调设置对语言模型校准的重要性。