Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this pa005 per, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length and budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.
翻译:大型语言模型在多种任务建模上取得了巨大成功,但这一成功伴随着高昂的计算成本,阻碍了其在工业界的广泛应用。本文提出MWT(多词分词器):一种通过将高频多词表达式表示为单个分词来突破单词边界的分词方法。MWT生成更紧凑高效的分词结果,带来两个优势:(1)在固定序列长度和预算下,对输入数据的覆盖率更高,从而提升模型性能;(2)因能以性能可忽略的损失缩短序列长度,实现更快更轻量的推理。我们的结果表明,MWT在较短序列长度上具有更强的鲁棒性,因此可通过早期序列截断实现显著的加速效果。