Robot vision often involves a large computational load due to large images to process in a short amount of time. Existing solutions often involve reducing image quality which can negatively impact processing. Another approach is to generate regions of interest with expensive vision algorithms. In this paper, we evaluate how audio can be used to generate regions of interest in optical images. To achieve this, we propose a unique attention mechanism to localize speech sources and evaluate its impact on a face detection algorithm. Our results show that the attention mechanism reduces the computational load. The proposed pipeline is flexible and can be easily adapted for human-robot interactions, robot surveillance, video-conferences or smart glasses.
翻译:机器人视觉常常面临巨大的计算负荷,原因在于需要在短时间内处理大尺寸图像。现有解决方案通常涉及降低图像质量,但可能对处理效果产生负面影响。另一种方法是使用计算密集型的视觉算法生成感兴趣区域。本文评估了如何利用音频在光学图像中生成感兴趣区域。为实现此目标,我们提出了一种独特的注意力机制来定位语音源,并评估了该机制对人脸检测算法的影响。实验结果表明,该注意力机制降低了计算负荷。所提出的流水线具有灵活性,可轻松适配于人机交互、机器人监控、视频会议或智能眼镜等场景。