We consider the issue of calibration in large language models (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.
翻译:本文探讨大型语言模型(LLM)的校准问题。近期研究发现,指令微调等常见干预手段往往导致LLM校准效果不佳。尽管校准在传统应用中已有深入研究,但LLM的校准面临独特挑战。这些挑战既源于LLM严苛的计算需求,也与其多功能性密切相关——LLM能够应用于多样化任务。针对这些挑战,我们提出THERMOMETER,一种专为LLM设计的校准方法。THERMOMETER通过多任务数据学习辅助模型,以实现对LLM的校准。该方法计算高效,能保持LLM的准确性,并为新任务生成校准效果更优的响应。在多个基准测试上的广泛实证评估验证了所提方法的有效性。