This paper presents SimBase, a simple yet effective baseline for temporal video grounding. While recent advances in temporal grounding have led to impressive performance, they have also driven network architectures toward greater complexity, with a range of methods to (1) capture temporal relationships and (2) achieve effective multimodal fusion. In contrast, this paper explores the question: How effective can a simplified approach be? To investigate, we design SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase only employs an element-wise product instead of intricate multimodal fusion. Remarkably, SimBase achieves state-of-the-art results on two large-scale datasets. As a simple yet powerful baseline, we hope SimBase will spark new ideas and streamline future evaluations in temporal video grounding.
翻译:本文提出SimBase,一种简单而有效的时序视频定位基线。尽管时序定位领域的最新进展带来了令人瞩目的性能,但也使得网络架构日趋复杂,涌现出多种方法用于(1)捕捉时序关系与(2)实现有效的多模态融合。相比之下,本文探讨了一个核心问题:简化方法能达到何种效果?为此,我们设计了SimBase网络,该网络采用轻量级的一维时序卷积层替代复杂的时序结构。在跨模态交互方面,SimBase仅使用逐元素乘积运算而非复杂的多模态融合机制。值得注意的是,SimBase在两个大规模数据集上取得了最先进的结果。作为一个简单而强大的基线,我们希望SimBase能够激发新思路,并简化未来时序视频定位领域的评估流程。