Recently, two approaches, fine-tuning large pre-trained language models and variational training, have attracted significant interests, separately, for semi-supervised end-to-end task-oriented dialog (TOD) systems. In this paper, we propose Variational Latent-State GPT model (VLS-GPT), which is the first to combine the strengths of the two approaches. Among many options of models, we propose the generative model and the inference model for variational learning of the end-to-end TOD system, both as auto-regressive language models based on GPT-2, which can be further trained over a mix of labeled and unlabeled dialog data in a semi-supervised manner. Variational training of VLS-GPT is both statistically and computationally more challenging than previous variational learning works for sequential latent variable models, which use turn-level first-order Markovian. The inference model in VLS-GPT is non-Markovian due to the use of the Transformer architecture. In this work, we establish Recursive Monte Carlo Approximation (RMCA) to the variational objective with non-Markovian inference model and prove its unbiasedness. Further, we develop the computational strategy of sampling-then-forward-computation to realize RMCA, which successfully overcomes the memory explosion issue of using GPT in variational learning and speeds up training. Semi-supervised TOD experiments are conducted on two benchmark multi-domain datasets of different languages - MultiWOZ2.1 and CrossWOZ. VLS-GPT is shown to significantly outperform both supervised-only and semi-supervised self-training baselines.
翻译:近期,微调大型预训练语言模型与变分训练这两种方法分别在半监督端到端任务导向对话系统(TOD)中引起了广泛关注。本文提出变分隐状态GPT模型(VLS-GPT),首次将这两种方法的优势相结合。在众多模型选项中,我们为端到端TOD系统的变分学习设计了生成模型与推理模型,两者均基于GPT-2的自回归语言模型,可通过半监督方式在标记与未标记对话数据的混合数据集上进行进一步训练。与以往针对序列隐变量模型(采用轮级一阶马尔可夫假设)的变分学习工作相比,VLS-GPT的变分训练在统计与计算层面更具挑战性。由于采用Transformer架构,VLS-GPT中的推理模型是非马尔可夫的。本研究建立了面向非马尔可夫推理模型的变分目标递归蒙特卡洛近似(RMCA)方法,并证明了其无偏性。进一步,我们开发了先采样再前向计算的计算策略以实现RMCA,成功克服了在变分学习中使用GPT时的内存爆炸问题,并加速了训练过程。在两个跨语言基准多域数据集(MultiWOZ2.1和CrossWOZ)上进行的半监督TOD实验表明,VLS-GPT显著优于纯监督基线及半监督自训练基线。