Video captioning models easily suffer from long-tail distribution of phrases, which makes captioning models prone to generate vague sentences instead of accurate ones. However, existing debiasing strategies tend to export external knowledge to build dependency trees of words or refine frequency distribution by complex losses and extra input features, which lack interpretability and are hard to train. To mitigate the impact of granularity bias on the model, we introduced a statistical-based bias extractor. This extractor quantifies the information content within sentences and videos, providing an estimate of the likelihood that a video-sentence pair is affected by granularity bias. Furthermore, with the growing trend of integrating contrastive learning methods into video captioning tasks, we use a bidirectional triplet loss to get more negative samples in a batch. Subsequently, we incorporate the margin score into the contrastive learning loss, establishing distinct training objectives for head and tail sentences. This approach facilitates the model's training effectiveness on tail samples. Our simple yet effective loss, incorporating Granularity bias, is referred to as the Margin-Contrastive Loss (GMC Loss). The proposed model demonstrates state-of-the-art performance on MSRVTT with a CIDEr of 57.17, and MSVD, where CIDEr reaches up to 138.68.
翻译:摘要:视频描述模型容易受到短语的长尾分布影响,导致模型倾向于生成模糊而非准确的描述。然而,现有的去偏策略通常依赖外部知识构建词语依赖树,或通过复杂损失函数和额外输入特征调整频率分布,这类方法缺乏可解释性且训练困难。为减轻粒度偏差对模型的影响,我们引入了一种基于统计的偏差提取器。该提取器量化句子与视频中的信息含量,从而估计视频-句子对受粒度偏差影响的概率。此外,针对对比学习方法在视频描述任务中的集成趋势,我们采用双向三元组损失在批次中获取更多负样本,随后将边际分数纳入对比学习损失,为头部和尾部句子建立不同的训练目标。这一策略提升了模型对尾部样本的训练效果。我们提出简洁而有效的损失函数——粒度偏差边际对比损失(GMC Loss)。该模型在MSRVTT数据集上以CIDEr评分57.17达到最优性能,在MSVD数据集上CIDEr评分更达到138.68。