Glitch tokens in Large Language Models (LLMs) can trigger unpredictable behaviors, threatening model reliability and safety. Existing detection methods often depend on predefined patterns, limiting their adaptability across diverse LLM architectures. We propose GlitchMiner, a gradient-based discrete optimization framework that efficiently identifies glitch tokens by leveraging entropy to quantify prediction uncertainty and a local search strategy for exploring the token space. Experiments across multiple LLM architectures show that GlitchMiner outperforms existing methods in both detection accuracy and adaptability, achieving over 10% average efficiency improvement. GlitchMiner enhances vulnerability assessment in LLMs, contributing to more robust and reliable applications. Code is available at https://github.com/wooozihui/GlitchMiner.
翻译:大语言模型中的故障令牌可能引发不可预测的行为,威胁模型的可靠性与安全性。现有检测方法通常依赖预定义模式,限制了其在不同大语言模型架构间的适应性。我们提出GlitchMiner——一种基于梯度的离散优化框架,通过利用熵量化预测不确定性,并采用局部搜索策略探索令牌空间,从而高效识别故障令牌。跨多种大语言模型架构的实验表明,GlitchMiner在检测准确性与适应性方面均优于现有方法,平均效率提升超过10%。GlitchMiner增强了大语言模型的脆弱性评估能力,有助于构建更稳健可靠的应用系统。代码发布于 https://github.com/wooozihui/GlitchMiner。