This paper aims to address critical issues in the field of Multi-Object Tracking (MOT) by proposing an efficient and computationally resource-efficient end-to-end multi-object tracking model, named MO-YOLO. Traditional MOT methods typically involve two separate steps: object detection and object tracking, leading to computational complexity and error propagation issues. Recent research has demonstrated outstanding performance in end-to-end MOT models based on Transformer architectures, but they require substantial hardware support. MO-YOLO combines the strengths of YOLO and RT-DETR models to construct a high-efficiency, lightweight, and resource-efficient end-to-end multi-object tracking network, offering new opportunities in the multi-object tracking domain. On the MOT17 dataset, MOTR\cite{zeng2022motr} requires training with 8 GeForce 2080 Ti GPUs for 4 days to achieve satisfactory results, while MO-YOLO only requires 1 GeForce 2080 Ti GPU and 12 hours of training to achieve comparable performance.
翻译:本文旨在解决多目标跟踪(MOT)领域的关键问题,提出一种高效且计算资源友好的端到端多目标跟踪模型,命名为MO-YOLO。传统MOT方法通常包含目标检测与目标跟踪两个独立步骤,导致计算复杂度高且存在误差累积问题。近期研究表明,基于Transformer架构的端到端MOT模型展现出卓越性能,但需要强大的硬件支持。MO-YOLO融合了YOLO与RT-DETR模型的优势,构建了一种高效、轻量级且资源友好的端到端多目标跟踪网络,为多目标跟踪领域提供了新的可能。在MOT17数据集上,MOTR\cite{zeng2022motr}需要8块GeForce 2080 Ti GPU训练4天才能取得满意结果,而MO-YOLO仅需1块GeForce 2080 Ti GPU训练12小时即可达到相近性能。