Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not limited to large amounts of data, expensive machinery, and lengthy training. To solve this problem, this paper proposes a new tokenization method inspired by universal Lempel-Ziv-Welch data compression that compresses repetitive phrases into multi-word tokens. With MultiTok as a new tokenizing tool, we show that language models are able to be trained notably more efficiently while offering a similar accuracy on more succinct and compressed training data. In fact, our results demonstrate that MultiTok achieves a comparable performance to the BERT and GPT standards as both a stand-alone tokenizer and an add-on to existing tokenizers while also providing close to 2.5x faster training with more than 30% less training data.
翻译:大语言模型通过引入更复杂的自然语言处理技术,深刻改变了人工智能的发展前景。然而,当前训练此类大语言模型的方法需要大量资源,包括但不限于海量数据、昂贵设备与漫长训练周期。为解决这一问题,本文提出一种受通用Lempel-Ziv-Welch数据压缩算法启发的新型分词方法,该方法可将重复短语压缩为多词词元。通过将MultiTok作为新型分词工具,我们证明语言模型能够在更精简压缩的训练数据上以相当精度实现显著更高效的训练。事实上,我们的结果表明,MultiTok无论是作为独立分词器还是现有分词器的附加模块,其性能均可媲美BERT与GPT标准,同时实现近2.5倍的训练加速,且所需训练数据减少超过30%。