Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs' ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models' ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer's vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.
翻译:诸如字节对编码(BPE)等分词方法虽然提升了大语言模型(LLM)的计算效率,但常常模糊了词元内部的字符结构。这一局限阻碍了LLM预测精确字符位置的能力,而在中文拼写纠错(CSC)等任务中,识别错别字的位置对于加速纠错过程至关重要。我们提出了词元内部位置感知(TIPA)方法,该方法通过利用分词器词汇表对模型进行反向字符预测任务训练,显著提升了模型捕捉词元内字符位置的能力。实验表明,TIPA提高了LLM在位置预测上的准确性,使其能够更精确地识别原始文本中的目标字符。此外,即使应用于不需要精确位置预测的下游任务,TIPA在需要字符级信息的任务中仍能提升性能,验证了其多功能性和有效性。