Large language models (LLMs) excel at processing and generating both text and code. However, LLMs have had limited applicability in grounded task-oriented dialogue as they are difficult to steer toward task objectives and fail to handle novel grounding. We present a modular and interpretable grounded dialogue system that addresses these shortcomings by composing LLMs with a symbolic planner and grounded code execution. Our system consists of a reader and planner: the reader leverages an LLM to convert partner utterances into executable code, calling functions that perform grounding. The translated code's output is stored to track dialogue state, while a symbolic planner determines the next appropriate response. We evaluate our system's performance on the demanding OneCommon dialogue task, involving collaborative reference resolution on abstract images of scattered dots. Our system substantially outperforms the previous state-of-the-art, including improving task success in human evaluations from 56% to 69% in the most challenging setting.
翻译:大型语言模型(LLMs)在处理和生成文本与代码方面表现出色。然而,由于难以引导其遵循任务目标且无法处理新增的具身化场景,LLMs在面向任务的具身对话中应用有限。我们提出了一种模块化且可解释的具身对话系统,通过将LLMs与符号化规划器及具身化代码执行相结合来克服这些缺陷。该系统包含读取器与规划器两个模块:读取器利用LLM将合作伙伴的话语转化为可执行代码,并调用执行具身化功能的函数;翻译后的代码输出被存储以追踪对话状态,同时符号化规划器确定下一轮适当响应。我们在高难度OneCommon对话任务上评估了系统性能——该任务涉及在抽象散点图像上进行协作性指代消解。实验结果显示,系统显著超越了此前最优方法,其中在最具挑战性的设置下,人类评估的任务成功率从56%提升至69%。