Model merging has gained significant attention as a cost-effective approach to integrate multiple single-task fine-tuned models into a unified one that can perform well on multiple tasks. However, existing model merging techniques primarily focus on resolving conflicts between task-specific models, they often overlook potential security threats, particularly the risk of backdoor attacks in the open-source model ecosystem. In this paper, we first investigate the vulnerabilities of existing model merging methods to backdoor attacks, identifying two critical challenges: backdoor succession and backdoor transfer. To address these issues, we propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Specifically, DAM employs a meta-learning-based optimization method with dual masks to identify a shared and safety-aware subspace for model merging. These masks are alternately optimized: the Task-Shared mask identifies common beneficial parameters across tasks, aiming to preserve task-specific knowledge while reducing interference, while the Backdoor-Detection mask isolates potentially harmful parameters to neutralize security threats. This dual-mask design allows us to carefully balance the preservation of useful knowledge and the removal of potential vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points while sacrificing only about 1% in accuracy. Furthermore, DAM exhibits robust performance and broad applicability across various types of backdoor attacks and the number of compromised models involved in the merging process. Our codes and models are available at https://github.com/Yangjinluan/DAM.
翻译:模型合并作为一种经济高效的方法,将多个单任务微调模型集成到一个能在多任务上表现良好的统一模型中,已获得广泛关注。然而,现有的模型合并技术主要关注解决任务特定模型之间的冲突,往往忽视了潜在的安全威胁,尤其是在开源模型生态系统中后门攻击的风险。本文首先研究了现有模型合并方法对后门攻击的脆弱性,识别出两个关键挑战:后门继承与后门转移。为解决这些问题,我们提出了一种新颖的防御感知合并方法,该方法能同时缓解任务干扰与后门漏洞。具体而言,DAM采用一种基于元学习的优化方法,通过双掩码机制为模型合并识别一个共享且安全感知的子空间。这些掩码交替优化:任务共享掩码识别跨任务的共同有益参数,旨在保留任务特定知识的同时减少干扰;而后门检测掩码则隔离潜在有害参数以消除安全威胁。这种双掩码设计使我们能够精细平衡有用知识的保留与潜在漏洞的消除。与现有合并方法相比,DAM在性能与安全性之间实现了更优的平衡,将攻击成功率降低了2-10个百分点,而准确率仅牺牲约1%。此外,DAM在各类后门攻击以及合并过程中涉及的受损模型数量方面均展现出稳健的性能与广泛的适用性。我们的代码与模型发布于 https://github.com/Yangjinluan/DAM。