Text-video retrieval (TVR) has seen substantial advancements in recent years, fueled by the utilization of pre-trained models and large language models (LLMs). Despite these advancements, achieving accurate matching in TVR remains challenging due to inherent disparities between video and textual modalities and irregularities in data representation. In this paper, we propose Text-Video-ProxyNet (TV-ProxyNet), a novel framework designed to decompose the conventional 1-to-N relationship of TVR into N distinct 1-to-1 relationships. By replacing a single text query with a series of text proxies, TV-ProxyNet not only broadens the query scope but also achieves a more precise expansion. Each text proxy is crafted through a refined iterative process, controlled by mechanisms we term as the director and dash, which regulate the proxy's direction and distance relative to the original text query. This setup not only facilitates more precise semantic alignment but also effectively manages the disparities and noise inherent in multimodal data. Our experiments on three representative video-text retrieval benchmarks, MSRVTT, DiDeMo, and ActivityNet Captions, demonstrate the effectiveness of TV-ProxyNet. The results show an improvement of 2.0% to 3.3% in R@1 over the baseline. TV-ProxyNet achieved state-of-the-art performance on MSRVTT and ActivityNet Captions, and a 2.0% improvement on DiDeMo compared to existing methods, validating our approach's ability to enhance semantic mapping and reduce error propensity.
翻译:近年来,借助预训练模型和大语言模型(LLMs)的应用,文本-视频检索(TVR)领域取得了显著进展。尽管取得了这些进步,但由于视频与文本模态之间的固有差异以及数据表示中的不规则性,在TVR中实现精确匹配仍然具有挑战性。本文提出了一种新颖的框架——文本-视频代理网络(TV-ProxyNet),旨在将TVR中传统的一对多关系分解为N个独立的一对一关系。通过用一系列文本代理替换单个文本查询,TV-ProxyNet不仅拓宽了查询范围,还实现了更精确的扩展。每个文本代理均通过我们称为“导向器”和“冲刺”的机制控制的精细化迭代过程来构建,这些机制调节代理相对于原始文本查询的方向和距离。这种设置不仅促进了更精确的语义对齐,还有效地管理了多模态数据中固有的差异和噪声。我们在三个具有代表性的视频-文本检索基准数据集(MSRVTT、DiDeMo和ActivityNet Captions)上进行的实验证明了TV-ProxyNet的有效性。结果显示,与基线相比,R@1指标提升了2.0%至3.3%。TV-ProxyNet在MSRVTT和ActivityNet Captions上取得了最先进的性能,并且在DiDeMo上相比现有方法提升了2.0%,验证了我们方法在增强语义映射和降低错误倾向方面的能力。