基于后见再生强化学习的交互式对话代理 (Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations)

Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a single effective response. In reality, many real dialogues are interactive, meaning an agent's utterances will influence their conversational partner, elicit information, or change their opinion. Accounting for how an agent can effectively steer a conversation is a crucial ability in many dialogue tasks, from healthcare to preference elicitation. Existing methods for fine-tuning dialogue agents to accomplish such tasks would rely on curating some amount of expert data. However, doing so often requires understanding the underlying cognitive processes of the conversational partner, which is a skill neither humans nor LLMs trained on human data can reliably do. Our key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. We apply our approach to two domains that require understanding human mental state, intelligent interaction, and persuasion: mental health support, and soliciting charitable donations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.

翻译：大型语言模型（LLM）的最新进展使得对话代理能够生成高度自然且可信的文本。然而，当前LLM的语言生成主要侧重于以单一有效回应准确回答问题和请求。现实中，许多真实对话是交互式的，即代理的话语会影响其对话伙伴、引发信息或改变其观点。考虑代理如何有效引导对话是许多对话任务中的关键能力，从医疗保健到偏好获取皆然。现有用于微调对话代理以完成此类任务的方法通常依赖于整理一定量的专家数据。但这样做往往需要理解对话伙伴的潜在认知过程，而无论是人类还是基于人类数据训练的LLM都难以可靠掌握这项技能。我们的核心洞见是：虽然LLM可能不擅长在对话前或进行中识别引导对话的有效策略，但它们能够在事后（即后见之明中）做到这一点——在观察到对话伙伴如何回应之后。我们利用这一事实重写并增广现有的次优数据，并通过离线强化学习（RL）训练出一个代理，其表现优于直接提示法及从未修改的人类示范数据中学习的方法。我们将该方法应用于两个需要理解人类心理状态、智能交互与说服的领域：心理健康支持与慈善募捐。在真实人类参与的用户研究中，我们的结果表明该方法显著优于现有的最先进对话代理。