How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

翻译：最近的工作表明,语言模型(LM)反映了关于事实或常识的不同类型的知识,然而,由于没有模型是完美的,它们仍然无法在许多情况中提供适当的答案。在本文中,我们问了一个问题,“当语言模型自信地知道某个特定问题的答案时,我们怎么知道我们如何知道?”我们从校准、概率模型预测概率的特性、预测产出或投入的调整等角度来研究这个问题。我们研究了三种强有力的基因模型 -- -- T5、BART和GPT-2 -- -- 是否很好地校准了它们关于QA任务的概率,发现答案是相对强烈的。我们然后研究如何校准这些模型,使其信心分数与通过微调、后概率修改或调整预测产出或投入的正确性的可能性更相关。对各种数据集的实验显示了我们的方法的有效性。我们还进行了分析,以研究这些方法的优点和局限性,我们还研究了这些方法的优点和局限性,并研究了这些方法的概率,发现答案是相对强烈的。我们然后研究了如何调整这些模型,以便了解如何进一步改进这些方法在校准Lamb/commmm中进行校准的方法。