Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.
翻译:社交媒体平台正日益被长格式多模态内容所主导,其中有害叙事通过音频、视觉和文本线索的复杂交互构建而成。尽管自动化系统能够以高准确率标记仇恨言论,但它们通常作为“黑箱”运行,无法提供有效人机协同审核所需的细粒度、可解释证据(例如精确时间戳和目标身份)。本工作中,我们提出了TANDEM这一统一框架,将视听仇恨检测从二元分类任务转化为结构化推理问题。我们的方法采用了一种新颖的协同强化学习策略,其中视觉-语言和音频-语言模型通过自约束跨模态上下文相互优化,从而在无需密集帧级监督的情况下实现对长时序序列的稳定推理。在三个基准数据集上的实验表明,TANDEM显著优于零样本和上下文增强基线,在HateMM数据集上实现了目标识别的0.73 F1分数(较现有最优性能提升30%),同时保持了精确的时序定位。我们进一步观察到,虽然二元检测具有鲁棒性,但在多类别设置中区分冒犯性内容与仇恨内容仍然具有挑战性,这源于固有的标签模糊性和数据集不平衡问题。更广泛而言,我们的研究结果表明,即使在复杂的多模态场景中,结构化、可解释的对齐也是可以实现的,这为下一代透明且可操作的在线安全审核工具提供了蓝图。