In this study, we introduce AV-PedAware, a self-supervised audio-visual fusion system designed to improve dynamic pedestrian awareness for robotics applications. Pedestrian awareness is a critical requirement in many robotics applications. However, traditional approaches that rely on cameras and LIDARs to cover multiple views can be expensive and susceptible to issues such as changes in illumination, occlusion, and weather conditions. Our proposed solution replicates human perception for 3D pedestrian detection using low-cost audio and visual fusion. This study represents the first attempt to employ audio-visual fusion to monitor footstep sounds for the purpose of predicting the movements of pedestrians in the vicinity. The system is trained through self-supervised learning based on LIDAR-generated labels, making it a cost-effective alternative to LIDAR-based pedestrian awareness. AV-PedAware achieves comparable results to LIDAR-based systems at a fraction of the cost. By utilizing an attention mechanism, it can handle dynamic lighting and occlusions, overcoming the limitations of traditional LIDAR and camera-based systems. To evaluate our approach's effectiveness, we collected a new multimodal pedestrian detection dataset and conducted experiments that demonstrate the system's ability to provide reliable 3D detection results using only audio and visual data, even in extreme visual conditions. We will make our collected dataset and source code available online for the community to encourage further development in the field of robotics perception systems.
翻译:本研究提出AV-PedAware,一种自监督视听融合系统,旨在提升机器人应用中的动态行人感知能力。行人感知是众多机器人应用中的关键需求。然而,依赖摄像头和激光雷达覆盖多视角的传统方法成本高昂,且易受光照变化、遮挡及天气条件等因素影响。我们提出的解决方案通过低成本音频与视觉融合,模拟人类感知进行三维行人检测。本研究首次尝试利用视听融合监测脚步声,以预测周围行人的运动轨迹。该系统通过基于激光雷达生成标签的自监督学习进行训练,成为激光雷达行人感知方案的一种经济高效的替代方案。AV-PedAware以极低的成本实现了与激光雷达系统相当的检测效果。通过采用注意力机制,该系统能够处理动态光照和遮挡问题,克服了传统激光雷达与摄像头系统的局限性。为评估方法的有效性,我们收集了新的多模态行人检测数据集,并通过实验证明:即使在极端视觉条件下,系统仅凭音频与视觉数据仍能提供可靠的三维检测结果。我们将公开所收集的数据集与源代码,以推动机器人感知系统领域的进一步发展。