Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.
翻译:自然语言监督已被证明在诸多计算机视觉任务(如目标检测和活动识别)的零样本学习中具有有效性。然而,对于视频内容审核等更细微的任务,生成信息丰富的提示词可能颇具挑战。这种困难在于,除暴力与色情内容外,视频存在多种不适宜的原因。例如,诈骗者可能试图制作与热门教育视频相似但缺乏有效信息的垃圾内容。本文评估了多种CLIP变体在监督与零样本设定下对儿童卡通视频内容审核的表现。我们提出的模型(带投影层的Vanilla CLIP)在视频内容审核的恶意/良性(MOB)基准测试中超越了先前的工作。本文深入分析了上下文相关语言提示词如何影响内容审核性能。研究结果表明,在内容审核提示词中纳入更多上下文信息至关重要——尤其对于在CLIP训练数据中代表性不足的卡通视频。