Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization.
翻译:子词切分是神经机器翻译模型中构建词汇表的常用方法。然而,日益复杂的任务逐渐暴露出其固有缺陷。首先,词汇表一经学习便无法修改,难以适应新词的出现。其次,在多语言翻译场景中,不同语言间数据量的不平衡会传导至词汇表层面,加剧低资源语言相关的翻译困难。虽然基于字节的切分方案能有效解决上述问题,但字节模型仍需应对UTF-8字节序列固有的低信息密度挑战。现有研究通过局部上下文增强技术提升字符义表示,但未能根据输入特性自适应选择最佳上下文范围。为此,我们提出多尺度上下文融合方法,该方法在不同隐藏状态维度上学习多粒度上下文信息,并利用注意力模块动态融合多尺度上下文表征。实验表明,在多语言翻译及跨领域场景中,本方法显著优于基于子词及其他字节基准模型。代码已开源:https://github.com/ictnlp/Multiscale-Contextualization。