The recent introduction of the large-scale long-form MAD dataset for language grounding in videos has enabled researchers to investigate the performance of current state-of-the-art methods in the long-form setup, with unexpected findings. In fact, current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this work, we propose an effective way to circumvent the long-form burden by introducing a new component to grounding pipelines: a Guidance model. The purpose of the Guidance model is to efficiently remove irrelevant video segments from the search space of grounding methods by coarsely aligning the sentence to chunks of the movies and then applying legacy grounding methods where high correlation is found. We term these video segments as non-describable moments. This two-stage approach reveals to be effective in boosting the performance of several different grounding baselines on the challenging MAD dataset, achieving new state-of-the-art performance.
翻译:近期引入的大规模长视频MAD数据集(用于语言接地任务)使研究人员得以评估现有最先进方法在该长视频场景下的表现,并获得了意想不到的发现。事实上,当前的接地方法因无法处理长视频序列而难以应对这一挑战性任务。在本研究中,我们提出了一种有效规避长视频负担的方法,通过为接地流程引入新组件:引导模型。该模型的核心功能是通过将句子与电影片段进行粗粒度对齐,有效剔除接地方法搜索空间中不相关的视频片段,随后在高度相关区域应用传统接地方法。我们将这些视频片段称为“不可描述时刻”。这种两阶段方法在极具挑战性的MAD数据集上显著提升了多个不同接地基线模型的性能,并取得了新的最佳成果。