With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of "glitch tokens", which are anomalous tokens produced by established tokenizers and could potentially compromise the models' quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.
翻译:随着大型语言模型(LLM)在各领域的应用日益广泛,全面探究其意外行为及由此引发的后果变得至关重要。本研究首次引入并系统探索了"异常标记"现象——即由成熟的标记化器产生的异常词元,这些词元可能损害模型的响应质量。具体而言,我们利用三种不同标记化器对七种主流LLM进行了实验,共涉及182,517个标记。我们提出了已识别异常标记的分类体系,以及LLM在与异常标记交互时表现出的症状。基于异常标记倾向于在嵌入空间中聚集的观察,我们提出了GlitchHunter——一种基于迭代聚类的新颖技术,用于高效检测异常标记。评估表明,我们的方法在八个开源LLM上显著优于三种基线方法。据我们所知,本研究首次对异常标记进行了全面探究。我们提出的新检测方法为减轻LLM中与标记化相关的错误提供了宝贵见解。