DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on http://videocube.aitestunion.com/.

翻译：视觉语言追踪（VLT）已成为前沿研究领域，它利用语言数据增强算法对多模态输入的处理能力，并将传统单目标追踪（SOT）的范围扩展到视频理解应用。尽管如此，大多数VLT基准仍依赖于为每个视频人工标注的简洁文本描述。这些描述往往难以捕捉视频内容动态的细微差别，且语言风格缺乏多样性，受限于其统一的细节水平和固定的标注频率。因此，算法倾向于采用“记忆答案”的策略，偏离了实现更深入视频内容理解的核心目标。幸运的是，大语言模型（LLMs）的出现使得生成多样化文本成为可能。本研究利用LLMs为代表性SOT基准生成多样化的语义标注（在文本长度和粒度方面），从而建立了一个新颖的多模态基准。具体而言，我们（1）基于五个重要的VLT和SOT基准，提出了一个名为DTVLT的具有多样化文本的新视觉语言追踪基准，包含三个子任务：短期追踪、长期追踪和全局实例追踪。（2）在我们的基准中提供了四种粒度文本，考虑了语义信息的范围和密度。我们期望这种多粒度生成策略能为VLT和视频理解研究营造有利环境。（3）我们在DTVLT上进行了全面的实验分析，评估了多样化文本对追踪性能的影响，并希望所发现的现有算法性能瓶颈能够支持VLT和视频理解领域的进一步研究。所提出的基准、实验结果及工具包将逐步发布于http://videocube.aitestunion.com/。