Tokenization techniques such as Byte-Pair Encoding (BPE) and Byte-Level BPE (BBPE) have significantly improved the computational efficiency and vocabulary representation stability of large language models (LLMs) by segmenting text into tokens. However, this segmentation often obscures the internal character structures and sequences within tokens, preventing models from fully learning these intricate details during training. Consequently, LLMs struggle to comprehend the character compositions and positional relationships within tokens, especially when fine-tuned on downstream tasks with limited data. In this paper, we introduce Token Internal Position Awareness (TIPA), a novel approach that enhances LLMs' understanding of internal token structures by training them on reverse character prediction tasks using the tokenizer's own vocabulary. This method enables models to effectively learn and generalize character positions and internal structures. Experimental results demonstrate that LLMs trained with TIPA outperform baseline models in predicting character positions at the token level. Furthermore, when applied to the downstream task of Chinese Spelling Correction (CSC), TIPA not only accelerates model convergence but also significantly improves task performance.
翻译:分词技术(如字节对编码(BPE)和字节级BPE(BBPE))通过将文本分割为词元,显著提升了大语言模型(LLMs)的计算效率和词汇表征稳定性。然而,这种分割方式常常掩盖了词元内部的字符结构与序列,导致模型在训练过程中无法充分学习这些精细细节。因此,LLMs难以理解词元内部的字符组成与位置关系,尤其是在数据有限的下游任务上进行微调时。本文提出了一种新颖的方法——词元内部位置感知(TIPA),该方法通过使用分词器自身的词汇表对模型进行反向字符预测任务训练,以增强LLMs对词元内部结构的理解。此方法使模型能够有效学习并泛化字符位置与内部结构。实验结果表明,采用TIPA训练的LLMs在词元级别的字符位置预测任务上优于基线模型。此外,当应用于中文拼写纠错(CSC)这一下游任务时,TIPA不仅加速了模型收敛,还显著提升了任务性能。