The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
翻译:在线视频中仇恨内容的泛滥对个人福祉与社会和谐构成了严重威胁。然而,现有的视频仇恨检测解决方案要么严重依赖大规模人工标注,要么缺乏细粒度的时间定位精度。在本工作中,我们提出了LELA,首个基于大语言模型(LLM)的免训练仇恨视频定位框架。与依赖监督流程的现有最先进模型不同,LELA利用LLM和特定模态描述技术,以免训练的方式检测并时间定位仇恨内容。我们的方法将视频分解为五种模态,包括图像、语音、OCR、音乐和视频上下文,并采用多阶段提示方案计算每一帧的细粒度仇恨分数。我们进一步引入了组合匹配机制以增强跨模态推理。在两个具有挑战性的基准数据集HateMM和MultiHateClip上的实验表明,LELA大幅超越了所有现有的免训练基线方法。我们还提供了广泛的消融实验和定性可视化分析,确立了LELA作为可扩展且可解释的仇恨视频定位的坚实基础。