As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.
翻译:随着多目标跟踪任务不断向更通用和多模态的场景演进,现有MOT方法僵化且任务特定的架构日益阻碍其在不同任务间的适用性,并限制了适应新跟踪范式的灵活性。大多数方法依赖固定的输出头和定制化的跟踪流程,使其难以扩展至更复杂或指令驱动的任务。为应对这些局限,我们提出AR-MOT——一种新颖的自回归范式,将MOT构建为大型语言模型框架内的序列生成任务。该设计使模型能通过灵活的序列构造输出结构化结果,无需任何任务特定的输出头。为增强区域级视觉感知,我们引入基于预训练检测器的对象分词器。为缓解全局特征与区域特征间的错位问题,我们提出区域感知对齐模块;为支持长期跟踪,我们设计了可缓存历史对象令牌的时序记忆融合模块。AR-MOT具备强大的可扩展潜力,新模态或指令仅需修改输出序列格式即可集成,无需改变模型架构。在MOT17和DanceTrack数据集上的大量实验验证了本方法的可行性,其性能达到与前沿方法相当的水平,同时为构建更通用、灵活的多目标跟踪系统奠定了基础。