Recent work studies the cognitive capabilities of language models through psychological tests designed for humans. While these studies are helpful for understanding the general capabilities of these models, there is no guarantee that a model possessing sufficient capabilities to pass those tests would actually use those capabilities in performing real-life tasks. In this work, we formulate task-oriented cognitive capabilities, which are human-like cognitive capabilities that language models leverage to perform tasks. These capabilities are (i) the ability to quickly generate good candidate utterances (the search capability) (ii) the ability to predict how a listener interprets those utterances and choose the most appropriate one (the pragmatic capability). We design an evaluation scheme for comparing these capabilities of a language model with those of a human. Applying this scheme to examine various models in a navigation instruction generation problem, we find that their pragmatic capability is severely lacking. This insight leads us to augment them with better models of the listener and obtain a significant boost of 11% in success rate in guiding real humans. Our work advocates for having a principled procedure for aligning language models with humans that involves (i) formulating task-oriented capabilities, (ii) devising a method to quantify their deficiency, and (iii) iteratively improving them.
翻译:近期研究通过专为人类设计的心理测试来探究语言模型的认知能力。尽管这些研究有助于理解模型的通用能力,但并不能保证通过测试的模型在实际任务中会运用这些能力。本研究提出面向任务的认知能力——即语言模型在执行任务时运用的类人认知能力,具体包括:(i) 快速生成优质候选表述的能力(搜索能力);(ii) 预测听者如何理解这些表述并选择最恰当表述的能力(语用能力)。我们设计了一套评估方案,用于比较语言模型与人类在这两种能力上的表现。将该方案应用于导航指令生成问题中的多个模型时,发现其语用能力严重不足。这一发现促使我们为模型增强更优的听者建模能力,从而在引导真实人类的任务中实现了11%的成功率显著提升。本研究倡导建立一套规范的流程来对齐语言模型与人类认知,具体包括:(i) 明确面向任务的能力类型,(ii) 设计量化能力缺陷的方法,(iii) 通过迭代优化来弥补缺陷。