Task-oriented dialog systems enable users to accomplish tasks using natural language. State-of-the-art systems respond to users in the same way regardless of their personalities, although personalizing dialogues can lead to higher levels of adoption and better user experiences. Building personalized dialog systems is an important, yet challenging endeavor and only a handful of works took on the challenge. Most existing works rely on supervised learning approaches and require laborious and expensive labeled training data for each user profile. Additionally, collecting and labeling data for each user profile is virtually impossible. In this work, we propose a novel framework, P-ToD, to personalize task-oriented dialog systems capable of adapting to a wide range of user profiles in an unsupervised fashion using a zero-shot generalizable reward function. P-ToD uses a pre-trained GPT-2 as a backbone model and works in three phases. Phase one performs task-specific training. Phase two kicks off unsupervised personalization by leveraging the proximal policy optimization algorithm that performs policy gradients guided by the zero-shot generalizable reward function. Our novel reward function can quantify the quality of the generated responses even for unseen profiles. The optional final phase fine-tunes the personalized model using a few labeled training examples. We conduct extensive experimental analysis using the personalized bAbI dialogue benchmark for five tasks and up to 180 diverse user profiles. The experimental results demonstrate that P-ToD, even when it had access to zero labeled examples, outperforms state-of-the-art supervised personalization models and achieves competitive performance on BLEU and ROUGE metrics when compared to a strong fully-supervised GPT-2 baseline
翻译:面向任务的对话系统使用户能够通过自然语言完成任务。当前最先进的系统无论用户个性如何,都以相同方式回应,而个性化对话能提升系统采用率与用户体验。构建个性化对话系统是一项重要但充满挑战的任务,仅有少数研究涉足该领域。现有工作大多依赖监督学习方法,需要为每个用户画像收集昂贵且费时的人工标注训练数据。此外,为每个用户画像收集并标注数据实际上难以实现。本文提出新型框架P-ToD,通过零样本可泛化奖励函数,以无监督方式实现对广泛用户画像的自适应个性化面向任务对话系统。P-ToD采用预训练GPT-2作为骨干模型,分三个阶段运行:第一阶段进行任务特定训练;第二阶段通过近端策略优化算法,结合零样本可泛化奖励函数引导策略梯度,启动无监督个性化过程。该新型奖励函数能够量化生成回复的质量,即使对于未见过的用户画像同样有效。可选最终阶段利用少量标注训练样本对个性化模型进行微调。我们使用个性化bAbI对话基准在五个任务及多达180种不同用户画像上开展了广泛实验分析。结果表明,即使在零标注样本情况下,P-ToD仍优于现有最优监督个性化模型,在与强监督GPT-2基线的BLEU和ROUGE指标对比中展现出竞争力。