Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
翻译:精确的三维人体检测对于机器人、工业监控和安防等应用的安全性至关重要。本研究对仅使用相机、仅使用LiDAR以及相机-LiDAR融合三种方式的三维人体检测进行了系统评估。现有研究大多集中于自动驾驶领域,而本文则利用JRDB数据集,探究了在多样化室内外场景中的检测性能与鲁棒性。我们比较了三种代表性模型——BEVDepth(相机)、PointPillars(LiDAR)和DAL(相机-LiDAR融合)——并分析了它们在不同程度的遮挡和距离下的表现。结果表明,基于融合的方法在各项指标上均优于单模态模型,尤其是在具有挑战性的场景中。我们进一步研究了模型对传感器数据损坏和错位的鲁棒性,发现尽管DAL表现出更强的抗干扰能力,但它对传感器错位和某些基于LiDAR的数据损坏仍然敏感。相比之下,基于相机的BEVDepth模型性能最低,且受遮挡、距离和噪声的影响最大。我们的研究结果凸显了利用传感器融合技术以增强三维人体检测的重要性,同时也强调了需要持续开展研究以解决这些系统固有的脆弱性。