Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.
翻译:传统的多目标跟踪系统在定位与关联方面已取得显著精度,能有效回答“何处”与“何人”的问题。然而,这类系统常如自闭的观察者,虽能追踪几何轨迹,却对目标行为背后的语义“内容”与“动因”视而不见。为弥合几何感知与认知推理之间的鸿沟,我们提出**LLMTrack**——一种用于语义多目标跟踪的新型端到端框架。我们采用仿生设计理念,将精确定位与深度理解解耦:以Grounding DINO作为视觉感知模块,LLaVA-OneVision多模态大模型作为认知推理核心。我们设计了时空融合模块,通过聚合实例级交互特征与视频级上下文信息,使大语言模型能够理解复杂轨迹。此外,我们提出渐进式三阶段训练策略,包括视觉对齐、时序微调以及基于LoRA的语义注入,以高效适配大规模模型至跟踪领域。在BenSMOT基准上的大量实验表明,LLMTrack实现了最先进的性能,在实例描述、交互识别与视频摘要任务上显著优于现有方法,同时保持稳健的跟踪稳定性。