We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.
翻译:我们提出WorldSense基准测试,旨在评估大型语言模型在多大程度上能够一致地维持隐含的世界模型,通过测试它们如何从简单实体排列的描述中得出简单推理。WorldSense是一个合成基准测试,包含三种问题类型,每种类型都有各自的琐碎对照,通过将问题的抽象结构与词汇和表达式去相关,并将所有问题子部分与正确响应去相关,明确避免了偏差。我们在三个最先进的聊天型大型语言模型(GPT3.5、GPT4和Llama2-chat)上运行了该基准测试,并表明即使仅涉及三个对象,这些模型也会出错。此外,它们具有相当严重的响应偏差,倾向于偏好某些响应而不考虑问题。即使采用链式思维提示和上下文学习,错误仍然存在。最后,我们表明,虽然针对类似问题的微调确实会带来显著改进——在分布内和分布外——但微调模型无法推广到约束问题空间之外。