We present a novel method for extracting neural embeddings that model the background acoustics of a speech signal. The extracted embeddings are used to estimate specific parameters related to the background acoustic properties of the signal in a non-intrusive manner, which allows the embeddings to be explainable in terms of those parameters. We illustrate the value of these embeddings by performing clustering experiments on unseen test data and show that the proposed embeddings achieve a mean F1 score of 95.2\% for three different tasks, outperforming significantly the WavLM based signal embeddings. We also show that the proposed method can explain the embeddings by estimating 14 acoustic parameters characterizing the background acoustics, including reverberation and noise levels, overlapped speech detection, CODEC type detection and noise type detection with high accuracy and a real-time factor 17 times lower than an external baseline method.
翻译:我们提出了一种新颖的神经嵌入提取方法,用于建模语音信号的背景声学特性。所提取的嵌入以非侵入式方式用于估计与信号背景声学特性相关的特定参数,这使得嵌入能够通过这些参数得到解释。我们通过在未见过的测试数据上进行聚类实验,证明了这些嵌入的价值,并表明所提出的嵌入在三个不同任务中实现了95.2%的平均F1分数,显著优于基于WavLM的信号嵌入。我们还证明,该方法能够通过高精度地估计14个表征背景声学的声学参数(包括混响与噪声水平、重叠语音检测、CODEC类型检测和噪声类型检测)来解释嵌入,其实时因子比外部基准方法低17倍。