Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM's responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM's output layer to derive base-calibrated confidence for PoLLM's responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90\% compared to the best unsupervised baselines.
翻译:可靠的置信度对于信任大语言模型(LLM)的输出至关重要,然而广泛部署的后训练大语言模型(PoLLM)通常存在严重的过度自信问题,从而损害了这种信任。相比之下,我们观察到其对应的基础大语言模型往往能保持良好的校准状态。这自然启发我们利用基础大语言模型作为参考来校准PoLLM的置信度。本文提出了两种实现方法。一种直接的解决方案是BaseCal-ReEval,该方法通过将PoLLM的响应输入基础大语言模型进行评估,以获得平均概率作为置信度。该方法虽然有效,但引入了额外的推理开销。为解决此问题,我们提出了BaseCal-Proj,该方法训练一个轻量级投影网络,将PoLLM的最终层隐藏状态映射回其基础大语言模型的对应状态。这些投影状态随后由基础大语言模型的输出层处理,从而为PoLLM的响应推导出基于基础模型校准的置信度。值得注意的是,BaseCal是一种无需人工标注或修改大语言模型的无监督即插即用解决方案。在五个数据集和三个大语言模型系列上的实验证明了BaseCal的有效性,与最佳无监督基线相比,平均将期望校准误差(ECE)降低了42.90%。