For nearly three decades, language models derived from the $n$-gram assumption held the state of the art on the task. The key to their success lay in the application of various smoothing techniques that served to combat overfitting. However, when neural language models toppled $n$-gram models as the best performers, $n$-gram smoothing techniques became less relevant. Indeed, it would hardly be an understatement to suggest that the line of inquiry into $n$-gram smoothing techniques became dormant. This paper re-opens the role classical $n$-gram smoothing techniques may play in the age of neural language models. First, we draw a formal equivalence between label smoothing, a popular regularization technique for neural language models, and add-$\lambda$ smoothing. Second, we derive a generalized framework for converting any $n$-gram smoothing technique into a regularizer compatible with neural language models. Our empirical results find that our novel regularizers are comparable to and, indeed, sometimes outperform label smoothing on language modeling and machine translation.
翻译:近三十年来,基于$n$-gram假设的语言模型在该任务上一直保持最先进水平。其成功的关键在于应用了各种能够防止过拟合的平滑技术。然而,当神经语言模型取代$n$-gram模型成为最佳性能者后,$n$-gram平滑技术的重要性随之下降。事实上,认为$n$-gram平滑技术的研究方向已陷入沉寂并不为过。本文重新探讨了经典$n$-gram平滑技术在神经语言模型时代可能发挥的作用。首先,我们在神经语言模型常用的正则化技术标签平滑与add-$\lambda$平滑之间建立了形式上的等价关系。其次,我们推导出一个通用框架,可将任意$n$-gram平滑技术转换为与神经语言模型兼容的正则化器。实验结果表明,本文提出的新型正则化器在语言建模和机器翻译任务上表现与标签平滑相当,有时甚至更优。