Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.
翻译:分词器在大语言模型的信息编码中至关重要,但其发展近来陷入停滞,且存在固有缺陷。主要限制包括计算开销、词汇使用效率低下以及嵌入层和输出头层不必要的庞大。此外,其性能偏向于参考语料库,导致对代表性不足的语言效果下降。为弥补这些问题,我们提出T-FREE,该方法通过字符三元组上的稀疏激活模式直接嵌入单词,且无需参考语料库。T-FREE本质上利用了形态相似性,并允许对嵌入层进行强力压缩。在我们详尽的实验评估中,我们在这些层上实现了超过85%的参数削减,同时获得了具有竞争力的下游性能。此外,T-FREE在跨语言迁移学习中展现出显著改进。