Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.
翻译:多模态和大语言模型(LLMs)已经彻底改变了开放世界知识的利用方式,为各种任务和应用解锁了新的潜力。在众多领域中,视频领域尤其受益于这些能力。在本文中,我们提出了Highlight-CLIP(HL-CLIP),这是一种通过利用多模态模型中嵌入的预训练知识,专门用于视频高光检测任务的方法。通过简单地将多模态编码器与我们创新的显著性池化技术相结合进行微调,据我们所知,我们在高光检测任务(QVHighlight基准测试)中取得了最先进的性能。