Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.
翻译:紧凑型设备的单声道神经模型缺乏空间证据,而经典波束成形器在麦克风孔径仅有几厘米时失去分辨能力,这使得目标语音提取仍然困难。我们提出IsoNet,一种面向紧凑型4麦克风阵列的用户可选视听目标语音提取系统。IsoNet在U-Net掩码估计网络中融合了复杂多通道STFT特征、GCC-PHAT空间线索、人脸条件视觉嵌入和辅助到达方向监督。采用三种课程变体在25,000个具有渐进难度信噪比区间的模拟VoxCeleb混合语音上进行训练。在涵盖-1至10 dB信噪比的困难测试集上,IsoNet-CL1实现了9.31 dB的SI-SDR,较混合语音提升4.85 dB,PESQ达到2.13,STOI达到0.84。理想延迟求和与MVDR波束成形器分别使相同混合语音的SI-SDRi降低4.82 dB和6.08 dB,表明所提出的学习型多模态条件处理解决了传统空间滤波失效的场景。消融研究显示视觉条件、GCC-PHAT特征和扩展延迟箱编码带来持续性能提升。这些结果在受控仿真下建立了紧凑阵列、人脸可选语音提取的基线,并识别了实际部署中剩余的关键障碍,特别是相位重建、多干扰源混合以及仿真到现实的迁移问题。