Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, alternative tokenizations of the same phrases result in drastically lower rates of hallucination (93% reduction in Llama3.1). We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may impede the development of trustworthy language models.
翻译:分词是将人类可读文本转换为模型可读离散词元的关键步骤。然而,近期研究表明,分词器可能被利用以诱发模型的不良行为。本文聚焦于不完整词元——即由字节级字节对编码(BPE)分词产生的、包含游离字节的不可解码词元。我们假设此类词元高度依赖其相邻词元,且在与陌生词元配对时极为脆弱。为揭示这一脆弱性,我们提出不可能二元组:通过精心设计不完整词元与其依赖词元的分布外组合来实施攻击。实验表明,不可能二元组极易引发幻觉行为。令人惊讶的是,相同短语的替代分词方案能显著降低幻觉率(在Llama3.1中降低93%)。我们警示字节级BPE分词器可能引入的潜在脆弱性,这或将阻碍可信语言模型的开发。