The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
翻译:大视觉语言模型(LVLMs)的日益广泛部署引发了其在潜在恶意输入下的安全性担忧。然而,现有的多模态安全性评估主要关注静态图像输入所暴露的模型漏洞,忽视了视频的时序动态性可能引发的独特安全风险。为填补这一空白,我们提出了Video-SafetyBench,这是首个专门评估LVLMs在视频-文本攻击下安全性的综合基准。该基准包含2,264个视频-文本对,涵盖48个细粒度不安全类别,每个配对包含一个合成视频与一个有害查询(包含明确恶意内容)或一个良性查询(看似无害但在结合视频解读时会触发有害行为)。为生成语义准确的视频以进行安全性评估,我们设计了一个可控流程,将视频语义分解为主题图像(展示内容)和运动文本(运动方式),二者共同指导生成与查询相关的视频。为有效评估不确定或边界有害的输出,我们提出了RJScore,这是一种基于LLM的新型度量方法,它结合了评判模型的置信度与人类对齐的决策阈值校准。大量实验表明,良性查询的视频组合实现了平均67.2%的攻击成功率,揭示了LVLMs对视频诱导攻击存在一致的脆弱性。我们相信Video-SafetyBench将推动未来基于视频的安全性评估与防御策略的研究。