Previously, non-autoregressive models were widely perceived as being superior in generation efficiency but inferior in generation quality due to the difficulties of modeling multiple target modalities. To enhance the multi-modality modeling ability, we propose the diffusion glancing transformer, which employs a modality diffusion process and residual glancing sampling. The modality diffusion process is a discrete process that interpolates the multi-modal distribution along the decoding steps, and the residual glancing sampling approach guides the model to continuously learn the remaining modalities across the layers. Experimental results on various machine translation and text generation benchmarks demonstrate that DIFFGLAT achieves better generation accuracy while maintaining fast decoding speed compared with both autoregressive and non-autoregressive models.
翻译:以往,非自回归模型因难以建模多种目标模态,而被普遍认为在生成效率上具有优势,但在生成质量上却有所不足。为增强多模态建模能力,我们提出了扩散扫视Transformer,该模型采用了模态扩散过程和残差扫视采样。模态扩散过程是一种离散过程,能够在解码步骤间插值多模态分布;而残差扫视采样方法则引导模型在各层中持续学习剩余模态。在多种机器翻译和文本生成基准上的实验结果表明,与自回归和非自回归模型相比,DIFFGLAT在保持快速解码速度的同时,实现了更好的生成准确率。