Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos. Unlike most methods that rely on ground truth data in the form of subtitles, in this work we exploit modalities that come naturally with videos: (a) video frames as they contain visual information indispensable for scene understanding, (b) audio as it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses and (c) text automatically extracted with a speech-to-text model as it can provide rich information when processed by a Large Language Model. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet-W successfully exploits visual, auditory and textual cues to identify funny moments, while our findings reveal FunnyNet-W's ability to predict funny moments in the wild. FunnyNet-W sets the new state of the art for funny moment detection with multimodal cues on all datasets with and without using ground truth information.
翻译:自动理解喜剧中的搞笑时刻(即令人发笑的瞬间)颇具挑战性,因其与肢体语言、对话及文化等多重特征相关。本文提出FunnyNet-W模型,通过跨模态与自注意力机制融合视觉、音频和文本数据来预测视频中的搞笑时刻。与多数依赖字幕形式真实标注数据的方法不同,本研究利用视频自然伴随的模态:(a)视频帧,其包含场景理解不可或缺的视觉信息;(b)音频,其包含与搞笑时刻相关的高级线索(如语调、音高和停顿);(c)利用语音转文本模型自动提取的文本,经大型语言模型处理后可提供丰富信息。为获取训练标签,我们提出一种无监督方法以检测并标注搞笑音频片段。我们在五个数据集上开展实验:情景喜剧TBBT、MHD、MUStARD、Friends以及TED演讲UR-Funny。大量实验与分析表明,FunnyNet-W能成功利用视觉、听觉和文本线索识别搞笑时刻,且研究结果揭示了FunnyNet-W预测野视频中搞笑时刻的能力。在多模态搞笑时刻检测任务中,FunnyNet-W在是否使用真实标注信息的所有数据集上均取得了新最优性能。