Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.
翻译:摘要:内容审核系统将图像分类为安全或不安全,但缺乏空间定位和可解释性:它们无法解释检测到的敏感行为、涉及的人员或发生位置。我们提出了敏感基准(SenBen),首个面向敏感内容的大规模场景图基准,包含来自157部电影标注了视觉基因组风格场景图(25个对象类别、28个属性(包括疼痛、恐惧、攻击性和痛苦等情感状态)、14个谓词)的13,999帧,以及涵盖5个类别的16个敏感标签。通过多任务配方,我们将前沿视觉语言模型(VLM)蒸馏为紧凑的2.41亿参数学生模型,该配方通过基于后缀的对象标识、词汇感知召回(VAR)损失和解耦式Query2Label标签头(使用非对称损失)解决自回归场景图生成中的词汇不平衡问题,在SenBen召回率上相比标准交叉熵训练提升6.4个百分点。在基于场景图的指标上,我们的学生模型优于除Gemini系列外的所有评估VLM及所有商业安全API,同时在所有模型中取得最高的目标检测和字幕生成分数,推理速度提升7.6倍,GPU内存占用减少16倍。