Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.

翻译：监督学习方法在估计空间声学参数（如到达时间差、直达混响比和混响时间）方面已显示出有效性。然而，由于模拟与真实声学特性之间的不匹配以及标注真实世界数据的匮乏，这些方法仍面临仿真到现实的泛化问题。为此，本文提出一种充分利用无标注数据进行空间声学参数估计的自监督方法。首先，设计了一种新的前置任务——跨通道信号重建，旨在从未标注的多通道麦克风信号中学习通用的空间声学表征。我们掩蔽一个通道的部分信号，并要求模型对其进行重建，这使得模型能够从非掩蔽信号中学习空间声学信息，并从另一麦克风通道中提取声源信息。采用编码器-解码器结构来分离这两类信息。通过使用少量标注数据集微调预训练的空间编码器，该编码器可用于估计空间声学参数。其次，采用新型多通道音频Conformer作为编码器模型架构，该架构同时适用于前置任务与下游任务。它经过精心设计，能够捕捉时频域中空间声学的局部与全局特征。在仿真与真实数据上进行的五项声学参数估计任务的实验结果表明了所提方法的有效性。据我们所知，这是空间声学表征学习与多通道音频信号处理领域首个自监督学习方法。