This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a natural language sentence query, the goal is to recognize and determine temporal boundaries of action instances in the video described by the query. Recent works tackled this task by improving query inputs with large pre-trained language models (PLM) at the cost of more expensive training. However, the effects of this integration are unclear, as these works also propose improvements in the visual inputs. Therefore, this paper studies the effects of PLMs in TVG and assesses the applicability of parameter-efficient training with NLP adapters. We couple popular PLMs with a selection of existing approaches and test different adapters to reduce the impact of the additional parameters. Our results on three challenging datasets show that, without changing the visual inputs, TVG models greatly benefited from the PLM integration and fine-tuning, stressing the importance of sentence query representation in this task. Furthermore, NLP adapters were an effective alternative to full fine-tuning, even though they were not tailored to our task, allowing PLM integration in larger TVG models and delivering results comparable to SOTA models. Finally, our results shed light on which adapters work best in different scenarios.
翻译:本文探索了时序视频定位(Temporal Video Grounding,TVG)任务,其中给定一段未修剪的视频和一句自然语言查询,目标是识别并确定查询所描述的视频中动作实例的时序边界。近期研究通过引入大型预训练语言模型(PLMs)改进查询输入,从而解决该任务,但代价是训练成本更高。然而,由于这些工作同时改进了视觉输入,这种集成的效果尚不明确。因此,本文研究了PLMs在TVG中的作用,并评估了结合NLP适配器的参数高效训练的适用性。我们将流行的PLMs与若干现有方法配对,并测试了不同适配器以降低额外参数的影响。在三个具有挑战性的数据集上的结果表明,在不改变视觉输入的情况下,TVG模型从PLM集成和微调中显著受益,凸显了句子查询表示在该任务中的重要性。此外,尽管NLP适配器并非为我们的任务定制,但它们是全参数微调的有效替代方案,支持在更大规模的TVG模型中集成PLM,并取得了与当前最先进模型相当的结果。最后,我们的结果揭示了在不同场景下哪些适配器表现最佳。