A study is presented in which a contrastive learning approach is used to extract low-dimensional representations of the acoustic environment from single-channel, reverberant speech signals. Convolution of room impulse responses (RIRs) with anechoic source signals is leveraged as a data augmentation technique that offers considerable flexibility in the design of the upstream task. We evaluate the embeddings across three different downstream tasks, which include the regression of acoustic parameters reverberation time RT60 and clarity index C50, and the classification into small and large rooms. We demonstrate that the learned representations generalize well to unseen data and achieve similar performance compared to a fully supervised baseline.
翻译:本文提出一种采用对比学习方法从单通道混响语音信号中提取声学环境低维表征的研究。通过将房间脉冲响应与消声源信号进行卷积,作为数据增强技术,为上游任务的设计提供了显著的灵活性。我们在三个不同的下游任务上评估了所提取的表征,包括声学参数混响时间RT60和清晰度指数C50的回归,以及小房间与大房间的分类。研究表明,学习到的表征对未见数据具有良好的泛化能力,且性能与完全监督的基线方法相当。