Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context -- incorporating its pragmatics. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), models in three of these categories perform close to random. However, LLMs instruction-tuned at the example-level perform significantly better. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.
翻译:尽管大型语言模型(LLM)被广泛用作对话代理,但其性能评估未能涵盖沟通的一个关键方面:在语境中解释语言——即整合语用学。人类利用信念和关于世界的先验知识来解释语言。例如,我们直觉上理解对“你留下指纹了吗?”这一问题的回答“我戴了手套”意味着“没有”。为了探究LLM是否具备这种被称为“隐含意义”的推理能力,我们设计了一项简单任务,并对四类广泛使用的最先进模型进行了评估。结果发现,尽管仅评估需要二元推理(是或否)的话语,其中三类模型的性能接近随机水平。然而,在示例级别进行指令微调的LLM表现显著更优。这些结果表明,特定的微调策略能更有效地诱导模型产生语用理解能力。我们提出这些发现,作为进一步研究评估LLM如何在语境中解释语言的起点,并推动开发更符合语用学原理、更实用的人类话语模型。