Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.
翻译:视听方法结合视觉输入为语音分离领域的最新进展奠定了基础。然而,如何优化听觉与视觉输入的协同使用仍是一个活跃的研究方向。受皮层-丘脑-皮层回路(该回路中不同模态的感觉处理机制通过非薄束感觉丘脑相互调节)的启发,我们提出了一种新颖的皮层-丘脑-皮层神经网络(CTCNet)用于视听语音分离(AVSS)。首先,CTCNet在独立的听觉与视觉子网络中,以自下而上的方式学习分层听觉与视觉表征,模拟听觉与视觉皮层的功能。其次,受皮层区域与丘脑之间大量连接的启发,模型通过自上而下的连接在丘脑子网络中融合听觉与视觉信息。最后,模型将融合后的信息反馈至听觉与视觉子网络,并重复上述过程多次。在三个语音分离基准数据集上的实验结果表明,CTCNet以显著更少的参数量,在性能上大幅超越了现有AVSS方法。这些结果提示,模拟哺乳动物大脑的解剖连接组对推动深度神经网络的发展具有巨大潜力。项目仓库地址:https://github.com/JusperLee/CTCNet。