Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.
翻译:大型语言模型(LLMs)已展现出令人印象深刻的能力,但仍存在不一致性问题(例如,LLMs 对改写或无关紧要的顺序改变等干扰可能产生不同反应)。除了这些不一致性,我们还观察到 LLMs 虽然能够解决难题,却可能反常地在更简单的问题上失败。为评估这种从难到易的不一致性,我们开发了 ConsisEval 基准测试,其中每个条目包含一对具有严格难度顺序的问题。此外,我们引入了一致性分数的概念,以量化衡量这种不一致性,并通过相对一致性分数分析一致性改进的潜力。基于对多种现有模型的综合实验,我们发现:(1)GPT-4 获得了最高的 92.2\% 一致性分数,但由于冗余信息干扰、问题误解等原因,仍对特定问题存在不一致性;(2)能力更强的模型通常表现出更高的一致性,但也存在例外情况;(3)困难数据能同时提升微调和上下文学习的一致性。我们的数据和代码将在 GitHub 上公开。