Task-oriented dialogue (TOD) system is designed to accomplish user-defined tasks through dialogues. The TOD system has progressed towards end-to-end modeling by leveraging pre-trained large language models. Fine-tuning the pre-trained language models using only supervised learning leads to the exposure bias and token loss problem and it deviates the models from completing the user's task. To address these issues, we propose a TOD system that leverages a unified pre-trained language model, GPT2, as a base model. It is optimized using supervised learning and reinforcement learning (RL). The issues in the TOD system are mitigated using a non-differentiable reward function. The reward is calculated using the weighted sum of the success rate and BLEU evaluation metrics. The success rate and BLEU metrics in reward calculation guide the language model for user task completion while ensuring a coherent and fluent response. Our model is acquired by fine-tuning a pre-trained model on the dialogue-session level which comprises user utterance, belief state, system act, and system response. Experimental results on MultiWOZ2.1 demonstrate that our model increases the inform rate by 1.60% and the success rate by 3.17% compared to the baseline.
翻译:任务型对话系统旨在通过对话完成用户定义的任务。借助预训练大语言模型,任务型对话系统已朝着端到端建模方向发展。仅使用监督学习对预训练语言模型进行微调会导致暴露偏差和词元损失问题,并使模型偏离完成用户任务的目标。为解决这些问题,我们提出一种任务型对话系统,该系统以统一的预训练语言模型GPT2作为基础模型,并采用监督学习和强化学习进行优化。通过使用不可微分的奖励函数缓解任务型对话系统中的问题。奖励通过成功率与BLEU评估指标的加权和计算。奖励计算中的成功率和BLEU指标引导语言模型完成用户任务,同时确保生成连贯流畅的响应。我们的模型通过在对话会话级别(包含用户话语、信念状态、系统行为和系统响应)对预训练模型进行微调获得。在MultiWOZ2.1数据集上的实验结果表明,与基线相比,我们的模型将信息提供率提升了1.60%,成功率提升了3.17%。