With the widespread popularity of user-generated short videos, it becomes increasingly challenging for content creators to promote their content to potential viewers. Automatically generating appealing titles and covers for short videos can help grab viewers' attention. Existing studies on video captioning mostly focus on generating factual descriptions of actions, which do not conform to video titles intended for catching viewer attention. Furthermore, research for cover selection based on multimodal information is sparse. These problems motivate the need for tailored methods to specifically support the joint task of short video title generation and cover selection (TG-CS) as well as the demand for creating corresponding datasets to support the studies. In this paper, we first collect and present a real-world dataset named Short Video Title Generation (SVTG) that contains videos with appealing titles and covers. We then propose a Title generation and Cover selection with attention Refinement (TCR) method for TG-CS. The refinement procedure progressively selects high-quality samples and highly relevant frames and text tokens within each sample to refine model training. Extensive experiments show that our TCR method is superior to various existing video captioning methods in generating titles and is able to select better covers for noisy real-world short videos.
翻译:随着用户生成短视频的广泛普及,内容创作者向潜在观众推广其作品变得愈发具有挑战性。自动为短视频生成吸引人的标题和封面有助于吸引观众注意力。现有视频描述研究主要侧重于生成动作的事实性描述,这与旨在吸引观众注意的视频标题要求不符。此外,基于多模态信息的封面选择研究较为匮乏。这些问题促使我们需要专门的方法来支持短视频标题生成与封面选择(TG-CS)的联合任务,并需要创建相应数据集以支撑相关研究。本文首先收集并呈现了一个名为短视图标题生成(SVTG)的真实世界数据集,其中包含具有吸引人标题和封面的视频。随后,我们提出了一种基于注意力优化的标题生成与封面选择方法(TCR)用于TG-CS任务。该优化过程逐步筛选高质量样本及每个样本内高度相关的帧与文本标记,以优化模型训练。大量实验表明,我们的TCR方法在标题生成方面优于现有多种视频描述方法,并且能够为嘈杂的真实短视频选择更优的封面。