Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.
翻译:现有视频片段检索与高光检测方法在文本与视频特征对齐方面效率不足,导致性能欠佳且实际应用受限。为此,本文提出一种新型架构,该架构利用近期针对此类对齐任务设计的基础视频模型。结合引入的显著性引导交叉注意力机制与混合DETR架构,我们的方法在片段检索和高光检测任务中均实现了性能的显著提升。为进一步优化效果,我们构建了InterVid-MR——一个用于预训练的大规模高质量数据集。基于该数据集,我们的架构在QVHighlights、Charades-STA和TACoS基准测试中取得了最先进的性能。所提出的方法为视频-语言任务中的零样本学习与微调场景提供了高效且可扩展的解决方案。