Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50% of noise is added to the data.
翻译:文本生成模型对训练数据中的错误极为敏感。随着大规模网络爬取数据的广泛普及,如何提升模型在海量含噪网络文本训练中的鲁棒性成为关键问题。本文提出误差范数截断(ENT)方法,这是一种对标准训练目标进行鲁棒增强的方法,通过截断含噪数据实现优化。相较于仅使用负对数似然损失估计数据质量的方法,本研究通过考虑非目标标记的分布(此前工作常忽略此因素)实现了更精准的估计。在语言建模、机器翻译和文本摘要等领域的综合实验表明,采用ENT的文本生成模型在生成质量上优于标准训练方法及此前提出的软硬截断方法。进一步实验显示,该方法能提升模型对机器翻译中两类最具破坏性噪声的鲁棒性——当数据中混入高达50%噪声时,相比极大似然估计基线方法,BLEU值提升超过2点。