Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
翻译:识别用户自定义关键词对于个性化智能设备交互至关重要。以往的用户自定义关键词检测方法依赖于梅尔频率倒谱系数等短期谱特征来检测语音关键词。然而,由于这些特征在捕捉语音信号时序动态方面的能力有限,可能难以准确识别发音相近的音频-文本对。为解决这一挑战,我们提出使用移位差分系数,该系数通过融合长期时序信息,有助于捕捉发音变异性(连接音素间的过渡)。本研究使用基于交叉注意力的端到端系统,在四个不同数据集上将SDC特征与多种基线特征进行性能比较。此外,我们探索了SDC的不同配置,以寻找适合UDKWS任务的时序上下文。实验结果表明,SDC特征优于MFCC基线特征,在具有挑战性的Libriphrase-hard数据集上,曲线下面积提升了8.32%,等错误率改善了8.69%。此外,与最先进的UDKWS技术相比,所提方法展现出更优越的性能。