We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.
翻译:我们研究了在协作式积木搭建任务中字面解释与上下文推理的分离现象,在该任务中,构建者必须利用上下文推理来解决未明确指定的指令。基于已有的双说话者心理语言学范式——该范式对比了语用合作型说话者和仅字面可靠型说话者——我们提出了"构建我的意图"(Build What I Mean, BWIM),一个用于上下文意义构建的交互式基准测试。在BWIM中,模型必须通过执行上下文推理或以较小通信成本请求澄清来消解歧义。通过评估多个先进的大语言模型(LLMs),我们发现在判断与行为之间存在解离:虽然模型在明确的置信度评分中检测到说话者的不可靠性,但它们未能利用这一信息来引导高效的澄清行为。相反,我们观察到了次优策略,例如无视伙伴的过度澄清以及在不明确情况下回避提问的猜测行为。