The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.
翻译:多目标跟踪领域近期重新燃起对经典检测跟踪模式的兴趣,因其简洁性与强先验优势,使其摆脱了注意力跟踪方法复杂的设计与繁琐的调参过程。基于此,我们致力于将检测跟踪方法扩展至多模态场景,需从异构信息(如二维运动线索、视觉外观及姿态估计)中计算综合代价。具体而言,我们遵循一个案例研究:在可获得三维信息粗略估计且需将其与其他传统度量(如IoU)融合的条件下,现有方法常采用简单规则或复杂启发式策略来平衡各代价项的贡献。然而,这类方法存在两个问题:i) 需在验证集上精细调节特设超参数;ii) 隐含各代价项相互独立的假设,这在现实中并不成立。我们通过构建优雅的概率形式化解耦上述问题,将候选关联代价视为深度密度估计器生成的负对数似然值,该估计器经训练可建模正确关联的条件联合概率分布。在模拟与真实基准上的实验表明,我们的方法能持续增强多种检测跟踪算法的性能。