Recent works usually address Dialog policy learning DPL by training a reinforcement learning (RL) agent to determine the best dialog action. However, existing works on deep RL require a large volume of agent-user interactions to achieve acceptable performance. In this paper, we propose to make full use of the plain text knowledge from the pre-trained language model to accelerate the RL agent's learning speed. Specifically, we design a dialog action-aware transformer encoder (DaTrans), which integrates a new fine-tuning procedure named masked last action task to encourage DaTrans to be dialog-aware and distils action-specific features. Then, DaTrans is further optimized in an RL setting with ongoing interactions and evolves through exploration in the dialog action space toward maximizing long-term accumulated rewards. The effectiveness and efficiency of the proposed model are demonstrated with both simulator evaluation and human evaluation.
翻译:近期工作通常通过训练强化学习(RL)智能体来确定最佳对话动作,以解决对话策略学习问题。然而,现有深度强化学习研究需要大量智能体-用户交互才能达到可接受的性能。本文提出充分利用预训练语言模型中的纯文本知识来加速RL智能体的学习速度。具体而言,我们设计了一种对话动作感知的Transformer编码器(DaTrans),该编码器集成了名为"掩码最后动作任务"的新型微调流程,促使DaTrans具备对话感知能力并提取动作特定特征。随后,DaTrans在持续交互的强化学习环境中进一步优化,并通过在对话动作空间中进行探索来最大化长期累积奖励。通过仿真评估和人工评估验证了所提模型的有效性与效率。