The growing importance of multi-modal humor detection within affective computing correlates with the expanding influence of short-form video sharing on social media platforms. In this paper, we propose a novel two-branch hierarchical model for short-form video humor detection (SVHD), named Comment-aided Video-Language Alignment (CVLA) via data-augmented multi-modal contrastive pre-training. Notably, our CVLA not only operates on raw signals across various modal channels but also yields an appropriate multi-modal representation by aligning the video and language components within a consistent semantic space. The experimental results on two humor detection datasets, including DY11k and UR-FUNNY, demonstrate that CVLA dramatically outperforms state-of-the-art and several competitive baseline approaches. Our dataset, code and model release at https://github.com/yliu-cs/CVLA.
翻译:多模态幽默检测在情感计算中的重要性日益增长,这与社交媒体平台上短视频分享的广泛影响力密切相关。本文提出一种新颖的双分支层次化短视频幽默检测(SVHD)模型,命名为基于数据增强的多模态对比预训练的评论辅助视频-语言对齐(CVLA)。值得注意的是,CVLA不仅能够处理跨多种模态通道的原始信号,还能通过在一致语义空间中对齐视频和语言组件,生成恰当的多模态表征。在包含DY11k和UR-FUNNY两个幽默检测数据集上的实验结果表明,CVLA显著优于现有最先进方法及多个竞争性基线方法。我们的数据集、代码与模型已在https://github.com/yliu-cs/CVLA发布。