Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem (AIW problem) formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving, also expressing strong overconfidence in the wrong solutions, often backed up by plausible sounding explanation-like confabulations. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs. Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/AIW
翻译:大语言模型常被描述为基础模型的实例——即能够以少量样本或零样本方式在不同任务和场景间实现强迁移,并展现出能够通过增加预训练规模预测性能提升的缩放规律。这些关于模型在不同功能和任务上卓越表现的论断,主要基于各类标准化基准测试中取得的高分评估。本文通过一个简洁自然语言表述、人类可轻松解决的简单常识性问题(AIW问题),揭示了当前最大规模训练且声称具有强大功能的最先进模型存在显著的功能与推理能力崩溃现象。这种崩溃是戏剧性的:模型在面对本不应影响问题解决的细微变体时表现出剧烈波动,同时对错误解决方案表现出过度自信,并常辅以看似合理的解释性虚构叙述。各类标准干预措施(如多种增强提示技术,或通过多步重新评估敦促模型重新审视错误解决方案)均未能获得正确答案。我们将这些初步发现呈报给科学与技术界,以期激发对当前大语言模型宣称能力的紧急重新评估。此类重新评估还需要共同行动来建立标准化基准,从而有效检测那些显然未被当前最先进评估程序和基准发现的根本性推理缺陷。论文实验复现代码与原始实验数据可在 https://github.com/LAION-AI/AIW 获取。