Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.

翻译：大型语言模型（LLMs）已展现出令人印象深刻的能力，但仍存在不一致性问题（例如，LLMs 对改写或无关紧要的顺序改变等干扰可能产生不同反应）。除了这些不一致性，我们还观察到 LLMs 虽然能够解决难题，却可能反常地在更简单的问题上失败。为评估这种从难到易的不一致性，我们开发了 ConsisEval 基准测试，其中每个条目包含一对具有严格难度顺序的问题。此外，我们引入了一致性分数的概念，以量化衡量这种不一致性，并通过相对一致性分数分析一致性改进的潜力。基于对多种现有模型的综合实验，我们发现：（1）GPT-4 获得了最高的 92.2\% 一致性分数，但由于冗余信息干扰、问题误解等原因，仍对特定问题存在不一致性；（2）能力更强的模型通常表现出更高的一致性，但也存在例外情况；（3）困难数据能同时提升微调和上下文学习的一致性。我们的数据和代码将在 GitHub 上公开。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日