Glitch tokens in Large Language Models (LLMs) can trigger unpredictable behaviors, compromising model reliability and safety. Existing detection methods often rely on manual observation to infer the prior distribution of glitch tokens, which is inefficient and lacks adaptability across diverse model architectures. To address these limitations, we introduce GlitchMiner, a gradient-based discrete optimization framework designed for efficient glitch token detection in LLMs. GlitchMiner leverages an entropy-based loss function to quantify the uncertainty in model predictions and integrates first-order Taylor approximation with a local search strategy to effectively explore the token space. Our evaluation across various mainstream LLM architectures demonstrates that GlitchMiner surpasses existing methods in both detection precision and adaptability. In comparison to the previous state-of-the-art, GlitchMiner achieves an average improvement of 19.07% in precision@1000 for glitch token detection. By enabling efficient detection of glitch tokens, GlitchMiner provides a valuable tool for assessing and mitigating potential vulnerabilities in LLMs, contributing to their overall security.
翻译:大型语言模型(LLM)中的故障令牌可能引发不可预测的行为,损害模型的可靠性与安全性。现有检测方法通常依赖人工观察来推断故障令牌的先验分布,效率低下且缺乏对不同模型架构的适应性。为克服这些局限,本文提出GlitchMiner——一种基于梯度的离散优化框架,专为高效检测LLM中的故障令牌而设计。GlitchMiner利用基于熵的损失函数量化模型预测的不确定性,并结合一阶泰勒近似与局部搜索策略以有效探索令牌空间。我们在多种主流LLM架构上的评估表明,GlitchMiner在检测精度与适应性方面均超越现有方法。相较于先前最优方法,GlitchMiner在故障令牌检测的precision@1000指标上平均提升19.07%。通过实现故障令牌的高效检测,GlitchMiner为评估与缓解LLM潜在脆弱性提供了重要工具,有助于提升其整体安全性。