Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text autocompletion.
翻译:生成模型广泛应用于各类应用中,但在处理部分令牌对应的提示时往往表现困难。这一困难源于分词机制,其中部分令牌在推理过程中会偏离分布,导致输出不正确或无意义。本文研究了一种缓解生成模型中文本补全分词伪影的技术,即使在常规非子词情况下也能保持性能。该方法称为令牌对齐,涉及回溯到最后完整的令牌,并确保模型生成的内容与提示对齐。该方法在多种部分令牌场景中显示出显著改进,包括空格前缀和部分缩进等细微情况,且仅增加少量时间开销。本文详述的技术与分析有助于推动生成模型在处理部分输入方面的持续进步,对代码补全和文本自动补全等应用具有重要意义。