Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.
翻译:大语言模型在不同上下文中给出不一致答案时,会对其在要求一致性的任务(如问答、解释等)中的应用造成问题。本研究针对欠指定情形(即存在两个及以上正确答案的情况)提出了一个自一致性评估基准。我们使用模糊整数序列补全任务,对OpenAI模型系列开展了一系列行为实验。结果表明,平均一致率介于67%至82%之间,远高于模型随机一致性假设下的预测值,且随模型能力提升而增长。此外,我们通过改变提示语说话者身份和序列长度等稳健性检验发现,模型倾向于保持自一致性。这些结果表明,自一致性作为无需针对性训练即可涌现的能力而存在。尽管如此,我们仍发现模型在判断自身一致性时存在校准偏差,表现为过度自信与信心不足并存。本文还提出了一种非参数检验方法,用于从词元输出分布中判断模型是否为替代答案赋予非平凡概率。通过该检验发现,尽管自一致性有所提升,模型仍通常将显著概率权重分配给不一致的替代答案。这种概率质量分布表明,即使高度自一致的模型也会在内部计算多种可能的响应。