Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

from arxiv, v2.0. Adding further experiments on various AIW problem variations. AIW "Alice Female Power Boost", AIW Extension (AIW Ext). Including recent Claude 3.5 Sonnet and Qwen 2 72B Instruct results

Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem (AIW problem) formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving, also expressing strong overconfidence in the wrong solutions, often backed up by plausible sounding explanation-like confabulations. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs. Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/AIW

翻译：大语言模型常被描述为基础模型的实例——即能够以少量样本或零样本方式在不同任务和场景间实现强迁移，并展现出能够通过增加预训练规模预测性能提升的缩放规律。这些关于模型在不同功能和任务上卓越表现的论断，主要基于各类标准化基准测试中取得的高分评估。本文通过一个简洁自然语言表述、人类可轻松解决的简单常识性问题（AIW问题），揭示了当前最大规模训练且声称具有强大功能的最先进模型存在显著的功能与推理能力崩溃现象。这种崩溃是戏剧性的：模型在面对本不应影响问题解决的细微变体时表现出剧烈波动，同时对错误解决方案表现出过度自信，并常辅以看似合理的解释性虚构叙述。各类标准干预措施（如多种增强提示技术，或通过多步重新评估敦促模型重新审视错误解决方案）均未能获得正确答案。我们将这些初步发现呈报给科学与技术界，以期激发对当前大语言模型宣称能力的紧急重新评估。此类重新评估还需要共同行动来建立标准化基准，从而有效检测那些显然未被当前最先进评估程序和基准发现的根本性推理缺陷。论文实验复现代码与原始实验数据可在 https://github.com/LAION-AI/AIW 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日