With the help of simple fine-tuning, one can artificially embed hidden text into large language models (LLMs). This text is revealed only when triggered by a specific query to the LLM. Two primary applications are LLM fingerprinting and steganography. In the context of LLM fingerprinting, a unique text identifier (fingerprint) is embedded within the model to verify licensing compliance. In the context of steganography, the LLM serves as a carrier for hidden messages that can be disclosed through a designated trigger. Our work demonstrates that embedding hidden text in the LLM via fine-tuning, though seemingly secure due to the vast number of potential triggers (any sequence of characters or tokens could serve as a trigger), is susceptible to extraction through analysis of the LLM's output decoding process. We propose a novel approach to extraction called Unconditional Token Forcing. It is premised on the hypothesis that iteratively feeding each token from the LLM's vocabulary into the model should reveal sequences with abnormally high token probabilities, indicating potential embedded text candidates. Additionally, our experiments show that when the first token of a hidden fingerprint is used as an input, the LLM not only produces an output sequence with high token probabilities, but also repetitively generates the fingerprint itself. We also present a method to hide text in such a way that it is resistant to Unconditional Token Forcing, which we named Unconditional Token Forcing Confusion.
翻译:通过简单的微调,人们可以人为地将隐藏文本嵌入大型语言模型(LLMs)中。这些文本仅在LLM被特定查询触发时才会显现。其主要应用包括LLM指纹识别和隐写术。在LLM指纹识别场景中,模型内嵌入的唯一文本标识符(指纹)可用于验证许可合规性。在隐写术场景中,LLM作为隐藏信息的载体,可通过指定触发器进行揭示。我们的研究表明,尽管通过微调在LLM中嵌入隐藏文本看似安全(因潜在触发器数量庞大,任何字符或令牌序列都可能成为触发器),但通过分析LLM的输出解码过程仍可能被提取。我们提出了一种名为“无条件令牌强制”的新型提取方法,其基本假设是:将LLM词汇表中的每个令牌迭代输入模型,应能揭示具有异常高令牌概率的序列,这些序列即可能成为嵌入文本的候选对象。此外,实验表明当使用隐藏指纹的首个令牌作为输入时,LLM不仅会生成具有高令牌概率的输出序列,还会重复生成指纹本身。我们还提出了一种能够抵抗无条件令牌强制提取的文本隐藏方法,并将其命名为“无条件令牌强制混淆”。