FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting

Large Language Models (LLMs) are rapidly transforming the landscape of digital content creation. However, the prevalent black-box Application Programming Interface (API) access to many LLMs introduces significant challenges in accountability, governance, and security. LLM fingerprinting, which aims to identify the source model by analyzing statistical and stylistic features of generated text, offers a potential solution. Current progress in this area is hindered by a lack of dedicated datasets and the need for efficient, practical methods that are robust against adversarial manipulations. To address these challenges, we introduce FD-Dataset, a comprehensive bilingual fingerprinting benchmark comprising 90,000 text samples from 20 famous proprietary and open-source LLMs. Furthermore, we present FDLLM, a novel fingerprinting method that leverages parameter-efficient Low-Rank Adaptation (LoRA) to fine-tune a foundation model. This approach enables LoRA to extract deep, persistent features that characterize each source LLM. Through our analysis, we find that LoRA adaptation promotes the aggregation of outputs from the same LLM in representation space while enhancing the separation between different LLMs. This mechanism explains why LoRA proves particularly effective for LLM fingerprinting. Extensive empirical evaluations on FD-Dataset demonstrate FDLLM's superiority, achieving a Macro F1 score 22.1% higher than the strongest baseline. FDLLM also exhibits strong generalization to newly released models, achieving an average accuracy of 95% on unseen models. Notably, FDLLM remains consistently robust under various adversarial attacks, including polishing, translation, and synonym substitution. Experimental results show that FDLLM reduces the average attack success rate from 49.2% (LM-D) to 23.9%.

翻译：大语言模型（LLMs）正在迅速改变数字内容创作的格局。然而，普遍存在的黑盒应用程序编程接口（API）访问方式给许多LLMs的问责、治理和安全性带来了重大挑战。LLM指纹识别旨在通过分析生成文本的统计和风格特征来识别源模型，为这一问题提供了潜在的解决方案。目前该领域的发展受到专用数据集缺乏以及需要高效、实用且能对抗对抗性操纵的稳健方法的阻碍。为应对这些挑战，我们引入了FD-Dataset，这是一个全面的双语指纹识别基准数据集，包含来自20个知名专有和开源LLMs的90,000个文本样本。此外，我们提出了FDLLM，一种新颖的指纹识别方法，该方法利用参数高效的低秩自适应（LoRA）对基础模型进行微调。这种方法使LoRA能够提取表征每个源LLM的深层、持久特征。通过分析，我们发现LoRA适配促进了同一LLM的输出在表示空间中的聚合，同时增强了不同LLM之间的分离性。这一机制解释了为何LoRA在LLM指纹识别中特别有效。在FD-Dataset上进行的大量实证评估证明了FDLLM的优越性，其宏F1分数比最强基线高出22.1%。FDLLM还对最新发布的模型展现出强大的泛化能力，在未见模型上平均准确率达到95%。值得注意的是，FDLLM在多种对抗性攻击（包括润色、翻译和同义词替换）下始终保持稳健。实验结果表明，FDLLM将平均攻击成功率从49.2%（LM-D）降低至23.9%。