单一单应性矩阵足矣：基于交互多模型的联合单应性与多目标状态估计 (One Homography is All You Need: IMM-based Joint Homography and Multiple Object State Estimation)

A novel online MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. IMM-JHSE uses an initial homography estimate as the only additional 3D information, whereas other 3D MOT methods use regular 3D measurements. By jointly modelling the homography matrix and its dynamics as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined using an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only, making IMM-JHSE robust to motion away from the ground plane. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques, including UCMCTrack, OC-SORT, C-BIoU and ByteTrack on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets. Using publicly available detections, IMM-JHSE outperforms almost all other 2D MOT methods and is outperformed only by 3D MOT methods -- some of which are offline -- on the KITTI-car dataset. Compared to tracking-by-attention methods, IMM-JHSE shows remarkably similar performance on the DanceTrack dataset and outperforms them on the MOT17 dataset. The code is publicly available: \url{https://github.com/Paulkie99/imm-jhse}.

翻译：本文提出了一种新颖的在线多目标跟踪算法——交互多模型联合单应性状态估计算法。该算法仅使用初始单应性估计作为额外的三维信息，而其他三维多目标跟踪方法通常依赖常规的三维测量。通过将单应性矩阵及其动态特性联合建模为轨迹状态向量的一部分，本方法消除了以往算法中相机运动补偿技术对预测轨迹位置状态的显式影响。在此基础上，本工作利用交互多模型滤波器融合静态与动态相机运动模型，并采用简单的边界框运动模型预测边界框位置以融入图像平面信息。除将交互多模型应用于相机运动外，本方法还采用了一种非标准的交互多模型策略：以类交互多模型的方式将基于边界框的BIoU分数与基于地平面的马氏距离相结合，专门用于数据关联，从而增强算法对目标偏离地平面运动的鲁棒性。最后，本算法采用了动态过程噪声与测量噪声估计技术。在DanceTrack和KITTI-car数据集上，本方法相较于UCMCTrack、OC-SORT、C-BIoU及ByteTrack等现有技术取得显著提升，HOTA指标分别提高2.64和2.11；同时在MOT17、MOT20和KITTI-pedestrian数据集上保持竞争力。基于公开检测结果，在KITTI-car数据集上，本算法性能超越几乎所有其他二维多目标跟踪方法，仅逊于部分三维多目标跟踪方法（其中一些为离线方法）。与基于注意力机制的跟踪方法相比，本算法在DanceTrack数据集上表现高度接近，并在MOT17数据集上实现超越。代码已公开：\url{https://github.com/Paulkie99/imm-jhse}。