Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30° angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.
翻译:想象一下,将您的智能手机置于嘈杂餐厅的桌面上,却能清晰捕捉到周围朋友的对话;或是在混响严重的礼堂中,清晰地录制讲师的授课内容。我们推出声学筛(SonicSieve),这是首个利用仿生声学微结构实现智能手机智能定向语音提取的系统。我们的被动式设计无需任何额外电子元件,即可将方向性线索嵌入输入语音信号中。该结构可附着于低成本有线耳机的内置麦克风上,而此类耳机可直接连接智能手机。我们提出了一种端到端神经网络,可在移动设备上实时处理原始混合音频。实验结果表明,当聚焦于30°角区域时,声学筛可实现5.0 dB的信号质量提升。此外,仅基于两个麦克风的系统性能已超越传统五麦克风阵列的表现。