Distributed microphone array (DMA) is a promising next-generation platform for speech interaction, where speech enhancement (SE) is still required to improve the speech quality in noisy cases. Existing SE methods usually first gather raw waveforms at a fusion center (FC) from all devices and then design a multi-microphone model, causing high bandwidth and energy costs. In this work, we propose a \emph{Compress-and-Send Network (CaSNet)} for resource-constrained DMAs, where one microphone serves as the FC and reference. Each of other devices encodes the measured raw data into a feature matrix, which is then compressed by singular value decomposition (SVD) to produce a more compact representation. The received features at the FC are aligned via cross window query with respect to the reference, followed by neural decoding to yield spatially coherent enhanced speech. Experiments on multiple datasets show that the proposed CaSNet can save the data amount with a negligible impact on the performance compared to the uncompressed case. The reproducible code is available at https://github.com/Jokejiangv/CaSNet.
翻译:分布式麦克风阵列(DMA)是下一代语音交互的有前景平台,在嘈杂环境下仍需语音增强(SE)以提升语音质量。现有SE方法通常首先将所有设备的原始波形汇集至融合中心(FC),再设计多麦克风模型,导致较高的带宽与能耗成本。本研究针对资源受限的DMA提出一种**压缩发送网络(CaSNet)**,其中一个麦克风作为FC及参考设备。其余每个设备将采集的原始数据编码为特征矩阵,再通过奇异值分解(SVD)进行压缩以生成更紧凑的表示。FC接收的特征通过跨窗口查询与参考信号对齐,随后经神经解码生成空间相干的增强语音。在多数据集上的实验表明,相较于未压缩方案,所提CaSNet能在性能影响可忽略的前提下显著减少数据传输量。可复现代码详见https://github.com/Jokejiangv/CaSNet。