While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding. Moreover, their QAs are unduly narrow and modality-biased, lacking a wider view of understanding long-term video content with rich dynamics and complex narratives. To remedy this, we introduce MoVQA, a long-form movie question-answering dataset, and benchmark to assess the diverse cognitive capabilities of multimodal systems rely on multi-level temporal lengths, with considering both video length and clue length. Additionally, to take a step towards human-level understanding in long-form video, versatile and multimodal question-answering is designed from the moviegoer-perspective to assess the model capabilities on various perceptual and cognitive axes.Through analysis involving various baselines reveals a consistent trend: the performance of all methods significantly deteriorate with increasing video and clue length. Meanwhile, our established baseline method has shown some improvements, but there is still ample scope for enhancement on our challenging MoVQA dataset. We expect our MoVQA to provide a new perspective and encourage inspiring works on long-form video understanding research.
翻译:摘要:尽管已有多个长视频问答数据集被提出,但用于构建问题的视频长度以及用于回答问题的线索子片段长度均尚未达到真正长篇视频理解的标准。此外,现有问答存在过度狭隘和模态偏向的问题,缺乏对具有丰富动态和复杂叙事的长期视频内容的广泛理解视角。为解决这一问题,我们提出了MoVQA——一个长篇电影问答数据集与基准测试,通过考虑视频长度和线索长度,从多层级时间跨度评估多模态系统的多样化认知能力。同时,为向人类级别的长篇视频理解迈进,我们从观影者视角设计了多功能、多模态的问答,以评估模型在多种感知与认知维度上的能力。对多种基线模型的分析显示出一致趋势:所有方法的性能均随视频和线索长度增加而显著下降。我们建立的基线方法虽取得一定改进,但在具有挑战性的MoVQA数据集上仍有较大提升空间。期望MoVQA能为长篇视频理解研究提供新视角,并激发更多创新性工作。