With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of "glitch tokens", which are anomalous tokens produced by established tokenizers and could potentially compromise the models' quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.
翻译:随着大型语言模型(LLMs)在各领域的应用日益广泛,系统性地探究其非预期行为及由此产生的后果变得至关重要。本研究首次引入并系统探讨了"故障令牌"现象——即由既定分词器生成的异常令牌,这些令牌可能损害模型的响应质量。具体而言,我们针对七种主流大型语言模型开展实验,使用三种不同的分词器,涵盖总计182,517个令牌。我们提出了故障令牌的分类体系,并揭示了大型语言模型在与故障令牌交互时所表现出的症状。基于故障令牌在嵌入空间中趋于聚集的观察,我们提出了GlitchHunter——一种基于迭代聚类的新型高效故障令牌检测技术。评估表明,该方法在八个开源大型语言模型上显著优于三种基准方法。据我们所知,这是首个针对故障令牌的综合性研究。我们提出的新检测方法为缓解大型语言模型中的分词相关错误提供了重要启示。