Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.
翻译:大型语言模型在多种任务建模中已展现出卓越性能,但其高昂的计算成本阻碍了工业界的广泛采用。本文提出MWT:一种突破词边界限制的多词分词器,通过将高频多词表达编码为单一标记实现序列压缩。MWT生成的紧凑高效分词方案具有两大优势:(1)在固定序列长度预算下,通过对输入数据的更高覆盖率提升模型性能;(2)在保持可忽略性能损失的前提下通过缩短序列长度实现更快速、更轻量的推理。实验结果表明,MWT在更短序列长度下具有更强的鲁棒性,从而可通过早期序列截断实现显著加速。