Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.
翻译:视频时序句子定位(TSGV)旨在从未修剪视频中检测自然语言查询所描述的事件时间戳。本文探讨了在TSGV模型中实现高效计算同时保持高性能的挑战。现有方法大多通过精巧设计复杂架构(如增加额外层和损失函数)来提升精度,但导致模型效率低下且臃肿。尽管已有工作注意到这一问题,但它们仅针对特征融合层进行优化,难以在整体笨重网络中实现高速优势。为解决该问题,我们提出一种基于知识蒸馏的新型高效多教师模型(EMTM),用以从异构和同构网络中迁移多样化知识。具体而言:首先,我们将异构模型的不同输出统一为单一形式;其次,构建知识聚合单元(KAU)以获取来自多位教师的高质量集成软标签;随后,KAU模块利用多尺度视频与全局查询信息自适应确定不同教师的权重;进而提出共享编码器策略,通过协同训练同构教师与学生网络以对齐隐藏状态,解决学生浅层网络难以受益于教师知识的问题。在三个主流TSGV基准上的大量实验表明,我们的方法无需复杂技巧即可实现高效性与有效性。