In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal
翻译:在嘈杂的对话环境(如晚宴)中,人们常表现出选择性听觉注意力,即能够专注于特定说话者而忽略其他声音的能力。识别对话中某人正在倾听谁,对于开发能够理解社交行为的技术以及通过放大特定声源来增强人类听觉的设备至关重要。计算机视觉和音频研究领域在场景中识别声源和说话者方面已取得显著进展。本研究在此基础上更进一步,专注于自我中心视频中听觉注意力目标的定位问题,即检测摄像头佩戴者视野中正在倾听的对象。针对这一新颖且具有挑战性的选择性听觉注意力定位问题,我们提出了一种端到端的深度学习方法,利用自我中心视频和多通道音频预测摄像头佩戴者听觉注意力的热力图。该方法利用时空视听特征和对场景的整体推理进行预测,并在一个具有挑战性的多说话者对话数据集上优于一系列基线方法。项目页面:https://fkryan.github.io/saal