Multi-object tracking (MOT) endeavors to precisely estimate the positions and identities of multiple objects over time. The prevailing approach, tracking-by-detection (TbD), first detects objects and then links detections, resulting in a simple yet effective method. However, contemporary detectors may occasionally miss some objects in certain frames, causing trackers to cease tracking prematurely. To tackle this issue, we propose BUSCA, meaning `to search', a versatile framework compatible with any online TbD system, enhancing its ability to persistently track those objects missed by the detector, primarily due to occlusions. Remarkably, this is accomplished without modifying past tracking results or accessing future frames, i.e., in a fully online manner. BUSCA generates proposals based on neighboring tracks, motion, and learned tokens. Utilizing a decision Transformer that integrates multimodal visual and spatiotemporal information, it addresses the object-proposal association as a multi-choice question-answering task. BUSCA is trained independently of the underlying tracker, solely on synthetic data, without requiring fine-tuning. Through BUSCA, we showcase consistent performance enhancements across five different trackers and establish a new state-of-the-art baseline across three different benchmarks. Code available at: https://github.com/lorenzovaquero/BUSCA.
翻译:多目标跟踪(MOT)旨在随时间推移精确估计多个目标的位置与身份。当前主流方法——基于检测的跟踪(TbD)——首先检测目标,然后关联检测结果,形成一种简洁而有效的策略。然而,现代检测器可能在某些帧中遗漏部分目标,导致跟踪器过早终止跟踪。为解决此问题,我们提出BUSCA(意为“搜索”),这是一个与任何在线TbD系统兼容的通用框架,旨在增强系统持续跟踪被检测器遗漏目标(主要因遮挡导致)的能力。值得注意的是,该框架无需修改历史跟踪结果或访问未来帧,即完全以在线方式实现。BUSCA基于邻近轨迹、运动信息和学习到的特征令牌生成候选目标。通过采用整合多模态视觉与时空信息的决策Transformer,它将目标-候选关联问题构建为多项选择问答任务。BUSCA独立于底层跟踪器进行训练,仅使用合成数据,无需微调。借助BUSCA,我们在五种不同跟踪器上展示了持续的性能提升,并在三个不同基准测试中建立了新的最先进基线。代码发布于:https://github.com/lorenzovaquero/BUSCA。