Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally, the effectiveness of transfer learning from the TTS system is investigated by analyzing its different intermediate representations. The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER).
翻译:在开放词汇场景中识别关键词对于个性化智能设备交互至关重要。现有开放词汇关键词检测方法依赖音频编码器和文本编码器构建的共享嵌入空间,但此类方法存在模态表征异构问题(即音频-文本不匹配)。为解决该问题,本文提出的框架利用预训练文本转语音(TTS)系统的知识进行迁移学习。这种知识迁移使文本编码器生成的文本表征能够融入对音频投影的感知能力。我们在四个不同数据集上,将所提方法与多种基线方法进行对比评估。通过分析不同词长及词汇外(OOV)场景下的性能,验证了模型的鲁棒性。此外,我们通过剖析TTS系统的不同中间表征,探究了迁移学习的有效性。实验结果表明,在具有挑战性的LibriPhrase Hard数据集中,所提方法在曲线下面积(AUC)和等错误率(EER)指标上分别比跨模态对应检测器(CMCD)方法提升了8.22%和12.56%。