We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning the embeddings of different input waveforms and training the model to faithfully reconstruct audio from mixed partitions, thereby ensuring each partition encodes a separate audio attribute. As use cases, we demonstrate the separation of speech from background noise or from reverberation characteristics. Our method also allows for targeted adjustments of the audio output characteristics.
翻译:我们提出了一种在神经音频编解码器嵌入空间中从嘈杂环境中分离语音信号的方法。我们引入了一种新的训练程序,使模型能够生成由嵌入向量表示的结构化音频波形编码,其中嵌入向量的一部分表示语音信号,其余部分表示环境。通过划分不同输入波形的嵌入,并训练模型从混合分区中忠实重建音频,我们确保每个分区编码一个独立的音频属性。作为应用案例,我们演示了从背景噪声或混响特性中分离语音的过程。该方法还允许对音频输出特性进行针对性调整。