In the field of multi-object tracking (MOT), recent Transformer based end-to-end models like MOTR have demonstrated exceptional performance on datasets such as DanceTracker. However, the computational demands of these models present challenges in training and deployment. Drawing inspiration from successful models like GPT, we present MO-YOLO, an efficient and computationally frugal end-to-end MOT model. MO-YOLO integrates principles from You Only Look Once (YOLO) and RT-DETR, adopting a decoder-only approach. By leveraging the decoder from RT-DETR and architectural components from YOLOv8, MO-YOLO achieves high speed, shorter training times, and proficient MOT performance. On the Dancetrack, MO-YOLO not only matches MOTR's performance but also surpasses it, achieving over twice the frames per second (MOTR 9.5 FPS, MO-YOLO 19.6 FPS). Furthermore, MO-YOLO demonstrates significantly reduced training times and lower hardware requirements compared to MOTR. This research introduces a promising paradigm for efficient end-to-end MOT, emphasizing enhanced performance and resource efficiency.
翻译:在多目标跟踪(MOT)领域,近年来基于Transformer的端到端模型(如MOTR)在DanceTracker等数据集上展现了卓越性能。然而,这些模型的计算需求给训练和部署带来了挑战。受GPT等成功模型的启发,我们提出了MO-YOLO——一种高效且计算经济型的端到端MOT模型。MO-YOLO融合了YOLO(You Only Look Once)与RT-DETR的原理,采用仅解码器(decoder-only)架构。通过利用RT-DETR的解码器与YOLOv8的架构组件,MO-YOLO实现了高速、短训练时间以及出色的MOT性能。在DanceTrack数据集上,MO-YOLO不仅与MOTR的性能持平,更实现了每秒帧数(FPS)两倍以上的提升(MOTR 9.5 FPS,MO-YOLO 19.6 FPS)。此外,与MOTR相比,MO-YOLO显著降低了训练时间与硬件需求。本研究为高效端到端MOT引入了一种有前景的范式,重点强调了性能提升与资源效率的优化。