Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as "I'm sorry, I can't help with that request." We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.
翻译:多模态大语言模型(MLLMs)已成为分析互联网规模图像数据的强大工具,带来显著益处的同时,也引发了关键的安全和社会担忧。尤其是,开源权重MLLMs可能被滥用于大规模提取个人图像中的敏感信息,如身份、位置或其他隐私细节。在本研究中,我们提出ImageProtector,一种用户端方法,通过嵌入精心设计、几乎不可察觉的扰动,在共享前主动保护图像,该扰动对MLLMs构成视觉提示注入攻击。因此,当攻击者使用MLLM分析受保护图像时,MLLM会被一致诱导生成拒绝响应,例如“抱歉,我无法帮助处理该请求。”我们通过六个MLLMs和四个数据集实证展示了ImageProtector的有效性。此外,我们评估了三种潜在对策——高斯噪声、DiffPure和对抗训练,并表明尽管它们能部分缓解ImageProtector的影响,但同时会降低模型的准确性和/或效率。我们的研究聚焦于开源权重MLLMs和大规模自动化图像分析这一实际重要场景,并突显了基于扰动的隐私保护的前景与局限性。