TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.

翻译：摘要：社交媒体平台日益被长篇幅多模态内容主导，其中有害叙事通过音频、视觉和文本线索的复杂交互构建。尽管自动化系统能以较高准确率标记仇恨言论，但这些系统通常作为"黑箱"运作，无法提供人类参与审核所必需的粒度化、可解释证据（如精确时间戳和攻击目标身份）。本文提出TANDEM统一框架，将视听仇恨检测从二元分类任务转变为结构化推理问题。我们采用新型串联强化学习策略，使视觉-语言模型与音频-语言模型通过自约束跨模态上下文相互优化，在无需密集帧级监督的条件下稳定扩展时序推理能力。在三个基准数据集上的实验表明，TANDEM显著优于零样本和上下文增强基线方法，在HateMM数据集上目标识别F1值达0.73（较最优方法提升30%），同时保持精确时间定位能力。我们进一步发现，尽管二元检测具有鲁棒性，但由于固有标签模糊性和数据集不平衡，在多分类场景中区分攻击性内容与仇恨内容仍具挑战性。更广泛而言，我们的研究结果表明，即使面对复杂多模态场景，结构化可解释对齐亦能实现，这为构建下一代透明且可操作的在线安全审核工具提供了蓝图。