LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.
翻译:基于激光雷达的三维多目标追踪通常仅依赖几何信息,但在长时间遮挡或人群密集环境中,这往往不足以区分目标。尽管融合基于RGB的重识别(ReID)为保持身份连续性提供了理论解决方案,但现有方法通常依赖计算昂贵的并行检测器,这阻碍了机器人的实时响应能力。本文对在线三维多目标追踪中的图像重识别进行了系统研究,采用轻量级投影框架解耦移动机器人的几何与外观建模。我们全面分析了特征提取架构,使用轻量级CNN和视觉Transformer,并评估了多种多模态数据关联策略,以平衡计算延迟与鲁棒追踪。在KITTI数据集的行人类别上的实验表明:外观与运动代价的朴素线性融合会因视觉噪声而降低性能;相反,级联匹配策略能在不牺牲整体精度的情况下有效恢复被遮挡轨迹,并防止身份切换以维持人机交互的连续性。我们证明,轻量级架构能够在安全导航所需的低延迟与社交感知所需的判别能力之间实现最优权衡。