Model fingerprinting has emerged as a crucial mechanism for safeguarding the intellectual property of open-source models, offering a non-intrusive approach that requires no modifications to the protected model. However, our analysis reveals that existing fingerprinting techniques are fundamentally vulnerable to false claim attacks, wherein adversaries can fraudulently assert ownership over independent third-party models. We demonstrate that this vulnerability stems from the untargeted nature of current methods, which evaluate model similarity based on arbitrary sample outputs rather than alignment with a specific, predefined reference. To mitigate this vulnerability, we introduce FIT-Print, a targeted fingerprinting paradigm that actively counters false claim attacks. Specifically, FIT-Print leverages optimization to transform the fingerprint into a verifiable, targeted signature. Building upon this foundation, we propose two black-box fingerprinting methods, the bit-wise FIT-ModelDiff and the list-wise FIT-LIME, which utilize output distances and feature attributions as robust model signatures, respectively. Extensive evaluations across benchmark models and datasets show that our framework perfectly neutralizes false claim attacks (100% defense success rate) and eliminates false alarms on independent models (0.0%), all while maintaining a 100% ownership verification rate against diverse model reuse techniques.
翻译:模型指纹识别已成为保护开源模型知识产权的一种关键机制,它提供了一种无需修改受保护模型的非侵入式方法。然而,我们的分析表明,现有指纹识别技术从根本上易受虚假声明攻击,即攻击者可以欺诈性地声称对独立第三方模型拥有所有权。我们证明,这种脆弱性源于当前方法的非目标化特性,它们基于任意样本输出而非与特定预定义参考的对齐来评估模型相似性。为缓解这一脆弱性,我们提出了FIT-Print,一种主动对抗虚假声明攻击的目标化指纹识别范式。具体而言,FIT-Print利用优化将指纹转化为可验证的目标化签名。在此基础之上,我们提出了两种黑盒指纹识别方法:逐位的FIT-ModelDiff和列表式的FIT-LIME,它们分别利用输出距离和特征归因作为鲁棒的模型签名。在多个基准模型和数据集上的大量评估表明,我们的框架能够完美地中和虚假声明攻击(防御成功率100%),并消除对独立模型的误报(0.0%),同时针对多种模型复用技术维持100%的所有权验证率。