Intelligent agents such as robots are increasingly deployed in real-world, safety-critical settings. It is vital that these agents are able to explain the reasoning behind their decisions to human counterparts, however, their behavior is often produced by uninterpretable models such as deep neural networks. We propose an approach to generate natural language explanations for an agent's behavior based only on observations of states and actions, agnostic to the underlying model representation. We show how a compact representation of the agent's behavior can be learned and used to produce plausible explanations with minimal hallucination while affording user interaction with a pre-trained large language model. Through user studies and empirical experiments, we show that our approach generates explanations as helpful as those generated by a human domain expert while enabling beneficial interactions such as clarification and counterfactual queries.
翻译:智能体(如机器人)越来越多地部署在真实世界、安全关键的场景中。这些智能体必须能够向人类同伴解释其决策背后的推理过程,然而它们的行为往往由深度神经网络等不可解释模型产生。我们提出一种仅基于状态和动作观测生成智能体行为自然语言解释的方法,该方法与底层模型表示无关。我们展示了如何学习智能体行为的紧凑表示,并利用预训练大语言模型在最小化幻觉的同时生成合理解释,同时支持用户交互。通过用户研究和实证实验,我们证明该方法能生成与人类领域专家同等有益的解释,并支持澄清和反事实查询等有益交互。