Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.
翻译:多样化视频描述旨在生成一组句子,从不同角度描述给定视频。主流方法通过视频及其真实描述集合中的独立对进行训练,未充分利用集合内部关系,导致生成描述的多样性较低。与此不同,我们将多样化描述问题形式化为语义概念引导的集合预测(SCG-SP)问题,通过将预测描述集合拟合至真实描述集合,充分捕获集合级别的语义关系。具体而言,我们的集合预测包含两个协同任务:描述生成与作为辅助任务的概念组合预测,后者提供额外的语义监督。集合中的每个描述均关联一个概念组合,该组合指示描述的核心语义内容,并促进集合预测中的元素对齐。此外,我们在概念上应用多样性正则化项,以鼓励模型生成具有不同概念组合的语义多样化描述。这两个任务共享多个语义特异性编码作为输入,这些编码通过视觉特征与概念查询的迭代交互获得。生成描述与特定概念组合之间的对应关系进一步保证了模型的可解释性。在基准数据集上的大量实验表明,所提出的SCG-SP在相关性和多样性指标下均达到了最优(SOTA)性能。