Language models (LMs) are statistical models trained to assign probability to human-generated text. As such, it is reasonable to question whether they approximate linguistic variability exhibited by humans well. This form of statistical assessment is difficult to perform at the passage level, for it requires acceptability judgements (i.e., human evaluation) or a robust automated proxy (which is non-trivial). At the word level, however, given some context, samples from an LM can be assessed via exact matching against a prerecorded dataset of alternative single-word continuations of the available context. We exploit this fact and evaluate the LM's ability to reproduce variability that humans (in particular, a population of English speakers) exhibit in the 'next word prediction' task. This can be seen as assessing a form of calibration, which, in the context of text classification, Baan et al. (2022) termed calibration to human uncertainty. We assess GPT2, BLOOM and ChatGPT and find that they exhibit fairly low calibration to human uncertainty. We also verify the failure of expected calibration error (ECE) to reflect this, and as such, advise the community against relying on it in this setting.
翻译:语言模型(LMs)是经过训练、用于对人工生成文本赋予概率的统计模型。因此,我们有理由质疑这些模型是否能很好地近似人类所表现出的语言变异性。这种形式的统计评估在段落层面难以进行,因为它需要可接受性判断(即人类评估)或一个稳健的自动化替代方法(这并非易事)。然而,在词汇层面,给定一定的上下文后,可以通过精确匹配与预先记录的现有上下文可选单次延续数据集,来评估语言模型生成的样本。我们利用这一事实,评估了语言模型在“下一个词预测”任务中再现人类(特别是英语使用者群体)所表现出的变异性的能力。这可以被视为评估一种校准形式,在文本分类领域,Baan等人(2022)将其称为对“人类不确定性的校准”。我们评估了GPT2、BLOOM和ChatGPT,发现它们对人类不确定性的校准程度相当低。我们还验证了预期校准误差(ECE)未能反映这一点,因此建议学术界在此场景下不要依赖这一指标。