This work proposes a learnable filterbank based on a multi-channel masking framework for multi-channel source separation. The learnable filterbank is a 1D Conv layer, which transforms the raw waveform into a 2D representation. In contrast to the conventional single-channel masking method, we estimate a mask for each individual microphone channel. The estimated masks are then applied to the transformed waveform representation like in the traditional filter-and-sum beamforming operation. Specifically, each mask is used to multiply the corresponding channel's 2D representation, and the masked output of all channels are then summed. At last, a 1D transposed Conv layer is used to convert the summed masked signal into the waveform domain. The experimental results show our method outperforms single-channel masking with a learnable filterbank and can outperform multi-channel complex masking with STFT complex spectrum in the STGCSEN model if a learnable filterbank is transformed to a higher feature dimension. The spatial response analysis also verifies that multi-channel masking in the learnable filterbank domain has spatial selectivity.
翻译:本文提出了一种基于多声道掩蔽框架的可学习滤波器组,用于多声道声源分离。该可学习滤波器组由一个一维卷积层构成,能将原始波形转换为二维表示。与传统的单声道掩蔽方法不同,我们为每个麦克风声道分别估计一个掩蔽矩阵。随后,这些估计的掩蔽矩阵被应用于经变换的波形表示,类似于传统的滤波求和波束成形操作。具体而言,每个掩蔽矩阵与对应声道的二维表示逐元素相乘,然后对所有声道的掩蔽输出进行求和。最后,利用一维转置卷积层将求和后的掩蔽信号转换回波形域。实验结果表明,我们的方法优于采用可学习滤波器组的单声道掩蔽方法;此外,当可学习滤波器组映射到更高特征维度时,其性能可超越STGCSEN模型中基于STFT复数频谱的多声道复数掩蔽方法。空间响应分析进一步验证了可学习滤波器域内的多声道掩蔽具有空间选择性。