Transformer-based chatbots can conduct fluent, natural-sounding conversations, but we have limited understanding of the mechanisms underlying their behavior. Prior work has taken a bottom-up approach to understanding Transformers by constructing Transformers for various synthetic and formal language tasks, such as regular expressions and Dyck languages. However, it is not obvious how to extend this approach to understand more naturalistic conversational agents. In this work, we take a step in this direction by constructing a Transformer that implements the ELIZA program, a classic, rule-based chatbot. ELIZA illustrates some of the distinctive challenges of the conversational setting, including both local pattern matching and long-term dialog state tracking. We build on constructions from prior work -- in particular, for simulating finite-state automata -- showing how simpler constructions can be composed and extended to give rise to more sophisticated behavior. Next, we train Transformers on a dataset of synthetically generated ELIZA conversations and investigate the mechanisms the models learn. Our analysis illustrates the kinds of mechanisms these models tend to prefer -- for example, models favor an induction head mechanism over a more precise, position based copying mechanism; and using intermediate generations to simulate recurrent data structures, like ELIZA's memory mechanisms. Overall, by drawing an explicit connection between neural chatbots and interpretable, symbolic mechanisms, our results offer a new setting for mechanistic analysis of conversational agents.
翻译:基于Transformer的聊天机器人能够进行流畅自然的对话,但我们对其行为背后的机制理解有限。先前研究通过为各类合成与形式语言任务(如正则表达式和Dyck语言)构建Transformer模型,采用自底向上的方法理解Transformer。然而,如何将这种方法扩展到理解更接近自然的对话智能体尚不明确。本研究通过构建实现经典规则驱动聊天机器人ELIZA程序的Transformer,朝着该方向迈出一步。ELIZA展现了对话场景中特有的挑战,包括局部模式匹配和长期对话状态追踪。我们基于先前工作中模拟有限状态自动机的构建方法,展示了如何通过组合与扩展简单结构来实现更复杂的行为。随后,我们在合成生成的ELIZA对话数据集上训练Transformer,并探究模型学习到的机制。分析表明这类模型倾向于特定的机制类型——例如,相较于基于位置的精确复制机制,模型更偏好归纳头机制;并通过中间生成过程模拟循环数据结构(如ELIZA的记忆机制)。总体而言,通过建立神经聊天机器人与可解释符号机制之间的显式关联,本研究为对话智能体的机制分析提供了新的研究框架。