Automatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor's reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles, featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline's potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.
翻译:自动从电影作品中提取引人入胜且高质量的幽默场景,对于制作吸引人的视频预告片和碎片化内容、提升流媒体平台用户参与度至关重要。长篇幅电影作品因其较长的时长和复杂的叙事结构,给场景定位带来挑战;而幽默依赖于多种模态且风格微妙,进一步增加了复杂性。本文提出了一种端到端系统,用于自动识别并排序长篇幅电影作品中的幽默场景,该系统包含镜头检测、多模态场景定位以及针对电影内容优化的幽默标注模块。关键创新包括:一种结合视觉与文本线索的新型场景分割方法、通过引导三元组挖掘改进的镜头表征技术,以及一种利用音频和文本的多模态幽默标注框架。我们的系统在OVSD数据集上相比最先进的场景检测方法实现了18.3%的平均精度提升,在长文本幽默检测中取得了0.834的F1分数。在五部电影作品上的广泛评估表明,我们流程提取的片段中有87%被设计为幽默内容,同时98%的场景被准确定位。该系统已成功推广至预告片场景,这些结果展示了该流程在增强内容创作工作流、提升用户参与度以及为多样化电影媒体格式简化碎片化内容生成方面的潜力。