Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.
翻译:时序语言定位旨在定位与自然语言查询语义对应的视频片段。近期研究采用注意力机制学习视频片段与文本查询之间的关系。然而,朴素注意力可能无法恰当捕捉此类关系,导致目标视频片段难以从其余片段中分离的低效分布。为解决该问题,我们提出一种基于能量建模的框架以显式学习片段-查询分布。此外,我们提出DemaFormer——一种新型基于Transformer的架构,利用可学习阻尼因子的指数移动平均有效编码片段-查询输入。在四个公开时序语言定位数据集上的全面实验表明,我们的方法优于当前最先进的基线模型。