Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video. Considering the fact that the question often remains concentrated in a short temporal range, we propose to first locate the question to a segment in the video and then infer the answer using the located segment only. Under this scheme, we propose "Locate before Answering" (LocAns), a novel approach that integrates a question locator and an answer predictor into an end-to-end model. During the training phase, the available answer label not only serves as the supervision signal of the answer predictor, but also is used to generate pseudo temporal labels for the question locator. Moreover, we design a decoupled alternative training strategy to update the two modules separately. In the experiments, LocAns achieves state-of-the-art performance on two modern long-term VideoQA datasets NExT-QA and ActivityNet-QA, and its qualitative examples show the reliable performance of the question localization.
翻译:视频问答(VideoQA)是视觉-语言理解中的关键任务,近年来吸引了大量研究关注。然而,现有方法主要在时长15秒以内的短视频上取得了显著性能。对于分钟级长视频的VideoQA,这些方法因缺乏处理场景变化和视频中多个动作所引发的噪声与冗余信息的能力而容易失效。考虑到问题往往集中于较短的时序范围,我们提出先定位问题在视频中的片段,然后仅利用此定位片段推断答案。在此框架下,我们提出“定位而后作答”(LocAns)方法——一种将问题定位器与答案预测器整合为端到端模型的新颖方案。在训练阶段,答案标签不仅作为答案预测器的监督信号,还用于为问题定位器生成伪时序标签。此外,我们设计了一种解耦交替训练策略来分别更新两个模块。实验结果表明,LocAns在两个现代长视频VideoQA数据集NExT-QA和ActivityNet-QA上达到了当前最优性能,其定性示例也展示了问题定位的可靠效果。