To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.
翻译:为实现有意义的人机交互(HRI),机器人必须通过持续跟踪用户来实时评估参与状态。然而,当前最先进的计算机视觉模型主要针对安防监控或自动驾驶场景进行了深度优化。社交机器人面临独特的自我中心视角挑战,例如人体晃动、相互遮挡或移出画面。频繁的身份切换(IDSW)会导致机器人在对话中途失去跟踪目标。为解决这一问题,我们通过Furhat机器人采集并标注了一个新颖的自定义自我中心数据集,以捕捉复杂的社会动态。我们提出系统性评估方法,将检测错误与跟踪逻辑分离,比较面部与身体跟踪效果,并评估扩展空间记忆与外观重识别(ReID)的影响。实验结果表明,增强空间记忆可缓解长时间遮挡问题,但无法应对复杂动态事件。引入ReID虽能解决复杂身份切换,却产生相反效应:显著提升身体跟踪稳定性,但因其对侧面角度敏感,导致面部IDSW激增。最终,我们优化的流水线将IDSW降低了49%,有效减少交互中断。由于现有标准基准缺乏密集近距遮挡场景,本工作凸显了原生采集社交动态数据对真正验证HRI感知模型的必要性。