Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization.
翻译:子词切分是神经机器翻译(NMT)模型中构建词汇表的常用方法。然而,日益复杂的任务逐渐暴露出其缺陷。首先,词汇表一经学习便无法修改,难以适应新词。其次,在多语言翻译中,不同语言间数据量的不平衡会蔓延至词汇表,加剧涉及低资源语言的翻译困难。虽然基于字节的切分方法解决了这些问题,但基于字节的模型难以处理UTF-8字节序列固有的低信息密度问题。先前的研究通过局部上下文化增强词元语义,但未能根据输入选择合适的上下文化范围。为此,我们提出了多尺度上下文化(MSC)方法,该方法在不同隐藏状态维度上学习不同尺度的上下文信息,并利用注意力模块动态融合多尺度上下文信息。实验表明,在多语言及领域外场景下,MSC方法显著优于基于子词的方法及其他基于字节的方法。代码可在 https://github.com/ictnlp/Multiscale-Contextualization 获取。