Calibrated Language Models and How to Find Them with Label Smoothing

from arxiv, Accepted to the Forty-second International Conference on Machine Learning (ICML) 2025. First two authors contributed equally. Official proceedings version available at https://proceedings.mlr.press/v267/huang25w.html

Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.

翻译：自然语言处理（NLP）领域的最新进展为通过提升指令跟随能力，使经过微调的大型语言模型（LLM）成为更强大的交互式智能体开辟了更广阔的前景。然而，理解这一进展如何影响模型输出的置信度校准以实现可靠输出，目前尚未得到充分研究。在本工作中，我们检验了多种开源LLM，发现每种模型在指令微调后均出现显著的校准退化。为寻求实用解决方案，我们关注标签平滑技术，该技术已被证明是正则化过度自信预测的有效方法，但尚未在LLM的监督微调（SFT）中得到广泛应用。我们首先阐释了为何标签平滑足以在整个SFT过程中维持校准性能。然而，在某些场景下，平滑效果会严重减弱，特别是在大规模词汇表LLM（LV-LLM）中。我们认为其根源在于模型产生过度自信的能力，这种能力与隐藏层大小和词汇表大小存在直接关联，并从理论和实验两方面验证了这一观点。最后，我们针对标签平滑损失设置中交叉熵损失计算的内存占用问题，设计了一种定制化计算内核，与现有非平滑损失解决方案相比，该内核能在不牺牲速度或性能的前提下显著降低内存消耗。