To date, the majority of video retrieval systems have been optimized for a "single-shot" scenario in which the user submits a query in isolation, ignoring previous interactions with the system. Recently, there has been renewed interest in interactive systems to enhance retrieval, but existing approaches are complex and deliver limited gains in performance. In this work, we revisit this topic and propose several simple yet effective baselines for interactive video retrieval via question-answering. We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task without access to ground truth dialogue data. Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
翻译:迄今为止,大多数视频检索系统都是针对"一次性"场景进行优化的,即用户单独提交查询,忽略了与系统的先前交互。最近,交互式系统在提升检索效果方面重新引起了关注,但现有方法复杂且性能提升有限。在这项工作中,我们重新探讨这一主题,并提出几种简单而有效的基于问答的交互式视频检索基线模型。我们采用VideoQA模型模拟用户交互,并表明这能够在无需真实对话数据的情况下有效研究交互式检索任务。在MSR-VTT、MSVD和AVSD上的实验表明,我们基于问题交互的框架显著提升了基于文本的视频检索系统的性能。