Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. Our proposed method, termed autogenic language embedding for visual feature enhancement, strengthens point correspondence in long-term sequences. Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network, enabling seamless adaptation to various tracking tasks without explicit text annotations. Additionally, we introduce a consistency decoder that efficiently integrates text tokens into visual features with minimal computational overhead. Through enhanced visual consistency, our approach significantly improves tracking trajectories in lengthy videos with substantial appearance variations. Extensive experiments on widely-used tracking benchmarks demonstrate the superior performance of our method, showcasing notable enhancements compared to trackers relying solely on visual cues.
翻译:点跟踪是计算机视觉中的一项挑战性任务,旨在建立长视频序列中点对点的对应关系。近期的进展主要集中于通过时序建模技术来改进局部特征相似性,往往忽略了被跟踪点本身蕴含的宝贵语义一致性。本文提出一种新颖方法,利用语言嵌入来增强与同一对象相关的逐帧视觉特征的连贯性。我们提出的方法称为用于视觉特征增强的自生成语言嵌入,该方法强化了长时序列中的点对应关系。与现有的视觉-语言方案不同,我们的方法通过一个专用的映射网络从视觉特征中学习文本嵌入,从而能够无缝适应各种跟踪任务,而无需显式的文本标注。此外,我们引入了一个一致性解码器,能够以最小的计算开销将文本标记高效地整合到视觉特征中。通过增强的视觉一致性,我们的方法在具有显著外观变化的长视频中显著改善了跟踪轨迹。在广泛使用的跟踪基准上进行的大量实验证明了我们方法的优越性能,与仅依赖视觉线索的跟踪器相比,展示了显著的提升。