The potential for exploitation of AI models has increased due to the rapid advancement of Artificial Intelligence (AI) and the widespread use of platforms like Model Zoo for sharing AI models. Attackers can embed malware within AI models through steganographic techniques, taking advantage of the substantial size of these models to conceal malicious data and use it for nefarious purposes, e.g. Remote Code Execution. Ensuring the security of AI models is a burgeoning area of research essential for safeguarding the multitude of organizations and users relying on AI technologies. This study leverages well-studied image few-shot learning techniques by transferring the AI models to the image field using a novel image representation. Applying few-shot learning in this field enables us to create practical models, a feat that previous works lack. Our method addresses critical limitations in state-of-the-art detection techniques that hinder their practicality. This approach reduces the required training dataset size from 40000 models to just 6. Furthermore, our methods consistently detect delicate attacks of up to 25% embedding rate and even up to 6% in some cases, while previous works were only shown to be effective for a 100%-50% embedding rate. We employ a strict evaluation strategy to ensure the trained models are generic concerning various factors. In addition, we show that our trained models successfully detect novel spread-spectrum steganography attacks, demonstrating the models' impressive robustness just by learning one type of attack. We open-source our code to support reproducibility and enhance the research in this new field.
翻译:随着人工智能(AI)技术的快速发展和Model Zoo等AI模型共享平台的广泛使用,AI模型被恶意利用的风险日益增加。攻击者可通过隐写技术将恶意软件嵌入AI模型中,利用模型庞大的参数量隐藏恶意数据并用于非法目的(例如远程代码执行)。确保AI模型安全已成为一个新兴的关键研究领域,对保护依赖AI技术的众多组织与用户至关重要。本研究通过一种新颖的图像表示方法将AI模型转换至图像领域,从而利用成熟的图像少样本学习技术。在该领域应用少样本学习使我们能够构建实用化的检测模型,这是以往研究未能实现的突破。我们的方法解决了现有最先进检测技术中影响实用性的关键局限:将所需训练数据集规模从40000个模型减少至仅6个。同时,我们的方法能稳定检测嵌入率高达25%的精细攻击(部分情况下甚至可检测6%嵌入率),而先前研究仅对100%-50%嵌入率的攻击有效。我们采用严格的评估策略确保训练模型在多种因素下具有泛化能力。此外,实验表明训练模型仅通过学习单一攻击类型即可成功检测新型扩频隐写攻击,展现出卓越的鲁棒性。我们已开源代码以支持结果复现,推动这一新兴领域的研究发展。