Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements.
翻译:预告片生成是一项具有挑战性的视频剪辑任务,其目标是从电影等长视频中选取高光镜头,并以吸引人的方式重新组织它们。在本研究中,我们提出了一个逆部分最优传输(IPOT)框架来实现音乐引导的电影预告片生成。具体而言,我们将预告片生成任务形式化为基于音频片段选择和排序关键电影镜头,这涉及跨视觉和听觉模态的潜在表示匹配。我们在所提出的IPOT框架中学习一个多模态潜在表示模型以实现这一目标。在该框架中,一个双塔编码器分别推导出电影和音乐片段的潜在表示,而一个注意力辅助的Sinkhorn匹配网络参数化这些片段潜在表示之间的基础距离以及电影片段分布。以电影片段与其预告片音乐片段之间的对应关系作为定义在基础距离上的观测最优传输方案,我们通过求解一个逆部分最优运输问题来学习模型,这引出了一个双层优化策略。我们收集真实世界的电影及其预告片,构建了一个具有丰富标签信息的数据集CMTD,并据此训练和评估了多种自动预告片生成器。与最先进的方法相比,我们的IPOT方法在主观视觉效果和客观定量测量方面均持续表现出优越性。