WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Youssef Benchekroun,Megi Dervishi,Mark Ibrahim,Jean-Baptiste Gaya,Xavier Martinet,Grégoire Mialon,Thomas Scialom,Emmanuel Dupoux,Dieuwke Hupkes,Pascal Vincent

We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.

翻译：我们提出WorldSense基准测试，旨在评估大型语言模型在多大程度上能够一致地维持隐含的世界模型，通过测试它们如何从简单实体排列的描述中得出简单推理。WorldSense是一个合成基准测试，包含三种问题类型，每种类型都有各自的琐碎对照，通过将问题的抽象结构与词汇和表达式去相关，并将所有问题子部分与正确响应去相关，明确避免了偏差。我们在三个最先进的聊天型大型语言模型（GPT3.5、GPT4和Llama2-chat）上运行了该基准测试，并表明即使仅涉及三个对象，这些模型也会出错。此外，它们具有相当严重的响应偏差，倾向于偏好某些响应而不考虑问题。即使采用链式思维提示和上下文学习，错误仍然存在。最后，我们表明，虽然针对类似问题的微调确实会带来显著改进——在分布内和分布外——但微调模型无法推广到约束问题空间之外。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日