Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes experimentation with BERT-QA, a text-only model, which demonstrates comparable performance to the original methods on both datasets, indicating the shortcomings in the formulation of these datasets. Furthermore, we also look into the domain adaptation aspect by examining the effectiveness of training on M4-ViteVQA and evaluating on NewsVideoQA and vice-versa, thereby shedding light on the challenges and potential benefits of out-of-domain training.
翻译:研究者已广泛探索视觉与语言领域,发现视觉和文本内容对有效理解场景至关重要。特别地,理解视频中的文本具有重大意义,既需要场景文本理解,也需要时序推理能力。本文聚焦于探究近期引入的两个数据集——NewsVideoQA和M4-ViteVQA,它们旨在解决基于文本内容的视频问答任务。NewsVideoQA数据集包含与新闻视频中文本相关的问题-答案对,而M4-ViteVQA则包含来自视频博客、旅行、购物等多样类别的问答对。我们从不同层面分析了这些数据集的构建方式,探讨回答问题时所需的视觉理解程度与多帧理解要求。此外,本研究采用文本类模型BERT-QA进行实验,该模型在两个数据集上展现出与原始方法相当的性能,从而揭示了这些数据集构建中存在的不足。更进一步,我们通过检验在M4-ViteVQA上训练并在NewsVideoQA上评估(反之亦然)的有效性,探究领域适应性方面的问题,由此阐明跨域训练所面临的挑战与潜在优势。