The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discriminability. In light of this, we propose measuring layer-wise similarity between AWEs and word embeddings, aiming to further investigate the inherent context within AWEs. Moreover, we evaluate the contribution of AWEs, in comparison to other types of speech features, in the context of Speech Emotion Recognition (SER). Through a comparative experiment and a layer-wise accuracy analysis on two distinct corpora, IEMOCAP and ESD, we explore differences between AWEs and raw self-supervised representations, as well as the proper utilization of AWEs alone and in combination with word embeddings. Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive SER accuracies by appropriately employing AWEs.
翻译:自监督语音模型的有效性已得到验证,但其表征在不同任务中的最优利用方式仍具挑战性。本研究聚焦声学词嵌入(AWEs)——一种从连续表征中提取的固定长度特征,以探索其在特定任务中的优势。此前研究表明AWEs在捕获声学可区分性方面具有潜力。基于此,我们提出测量AWEs与词嵌入之间的层级相似性,旨在进一步探究AWEs内含的上下文信息。此外,我们比较了AWEs与其他类型语音特征在语音情感识别(SER)任务中的贡献。通过在IEMOCAP和ESD两个不同语料库上开展的对比实验与层级精度分析,我们探讨了AWEs与原始自监督表征之间的差异,以及单独使用AWEs或将其与词嵌入结合的最优策略。研究结果凸显了AWEs所传递的声学上下文信息,并证明合理使用AWEs可取得极具竞争力的SER准确率。