This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task, since a question may have multiple answers. However, due to annotation costs, the labels in existing benchmarks are always extremely insufficient, typically one answer per question. As a result, existing works tend to directly treat all the unlabeled answers as negative labels, leading to limited ability for generalization. In this work, we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers, which contain rich knowledge about label priority as well as label-associated visual cues, thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model, we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings, and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.
翻译:本文聚焦于开放域视频问答任务,旨在从大量答案集合中为视频相关问题找到正确答案。该任务本质上是一个多标签分类问题,因为一个问题可能对应多个答案。然而,由于标注成本限制,现有基准数据集中的标签通常极度匮乏,每个问题往往仅有一个标签。因此,现有方法倾向于将所有未标注答案直接视为负标签,导致模型泛化能力受限。针对这一问题,本文提出一种简单而有效的排名蒸馏框架(RADI),无需额外人工标注即可缓解该困境。RADI利用基于不完整标签训练的教师模型,生成候选答案的排名序列,其中蕴含标签优先级及标签关联视觉线索的丰富知识,从而补充不充分的标签信息。为避免对不完美教师模型的过度自信,我们进一步提出两种鲁棒且无需参数的排名蒸馏方法:成对方法引入自适应软边界,动态优化各类成对排名的约束条件;列表方法则采用基于采样的部分列表学习,抵抗教师排名中的偏差。在五个主流基准数据集上的广泛实验一致表明,我们提出的成对与列表RADI方法均优于现有最优方法。进一步分析验证了该方法在处理标签不足问题上的有效性。