Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.
翻译:现有的大语言模型指纹识别方法通常教导模型对少量固定查询以预定义的非典型响应(密钥)作出回应。这种记忆机制往往无法在常见的部署步骤(如微调或量化)中保持有效,且此类密钥易于从模型响应中被检测和过滤,最终导致指纹失效。为克服这些限制,本文提出基于语义条件水印的大语言模型指纹识别方法:以广泛的语义域替代固定查询集,并用扩散于每个响应中的统计水印信号取代脆弱的非典型密钥。在教导模型仅对预定领域(如法语)的提示生成含水印的响应后,模型所有者可使用该领域的查询可靠地检测指纹并验证所有权。如我们在全面实验评估中所证实,该指纹方案兼具隐蔽性,并能抵御所有常见部署场景的干扰。