Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality. The approach uses Huffman coding to tokenize words, by order of frequency, using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for 90%-95% of the scores reached by BPE, hence compositionality has less importance than previously thought.
翻译:子词分词是神经语言模型和机器翻译系统中分词的事实标准。人们通常列举子词的三个优势:频繁词元的编码更短、子词的组合性以及处理未知词的能力。由于这些优势的相对重要性尚不完全明确,我们提出一种分词方法,能够将频率(第一个优势)与组合性分离。该方法利用霍夫曼编码,按频率顺序使用固定数量的符号对单词进行分词。基于CS-DE、EN-FR和EN-DE神经机器翻译的实验表明,仅凭频率一项即可达到BPE分数的90%-95%,因此组合性的重要性低于先前认知。